Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.2Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS When Loading Data from Memory

Optimization

Use the MOVLPS and MOVLPD instructions to move scalar floating-point data into the XMM registers prior to addition, multiplication, or other scalar instructions.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale—Single Precision

The MOVSS instruction is used to move scalar single-precision floating-point data into the XMM registers prior to addition (ADDSS) and multiplication (MULSS) or other scalar instructions. In addition to loading a 32-bit floating-point value into the XMM register, the MOVSS instruction clears the upper 96 bits of the register. Clearing part of the XMM register is an inefficiency that you can bypass by using the MOVLPS instruction. MOVLPS loads two floating-point values from memory without clearing the upper 64 bits of the XMM register.

The latency of the MOVSS instruction is 3 cycles, whereas the latency of the MOVLPS instruction is 2 cycles. The AMD Athlon™ 64 and AMD Opteron™ processors can perform two 64-bit loads per clock cycle. Two 64-bit MOVLPS loads can be issued in the same cycle, assuming the data is 8-byte aligned. Likewise, two MOVSS loads can be performed per cycle, but—unlike MOVLPS—additional operations that interfere with the MULSS and ADDSS instructions must be issued to clear the register. Using MOVLPS rather than MOVSS to load single-precision scalar data from memory on processor-limited floating-point-intensive code can result in significant performance increases.

Consider the following caveats when using the MOVLPS instruction:

When accessing 4-byte-aligned addresses that are not 8-byte aligned, MOVLPS loads take an additional cycle.

Since MOVLPS loads two floating-point values instead of one, accessing the last floating-point value in a single-precision array attempts to load 4 bytes of additional memory directly after the end of the array, which may cause an access violation. To avoid an access violation, use MOVSS to access the last value in a single-precision array or store a dummy floating-point value at the end of the array.

196

Optimizing with SIMD Instructions

Chapter 9

Page 212
Image 212
AMD 250 manual Rationale-Single Precision, 196