Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Use the Largest Possible Operand Size

Always move data using the largest operand size possible. For example, use REP MOVSD rather than

REP MOVSW, and REP MOVSW rather than REP MOVSB. Use REP STOSD rather than REP STOSW, and REP STOSW rather than REP STOSB.

In 64-bit mode, a quadword data size is available and offers better performance (for example,

REP MOVSQ and REP STOSQ).

Ensure DF = 0 (Increment)

Always make sure that DF is 0 (increment) after execution of CLD for rep movs and rep stos. DF = 1 (decrement) is only needed for certain cases of overlapping rep movs (for example, source and destination overlap).

While string instructions with DF = 1 (decrement) are slower, only the overhead part of the cycle equation is larger and not the throughput part. See Table 6 on page 167 for additional latency numbers.

Align Source and Destination with Operand Size

For rep movs, make sure that both the source and destination are aligned with regard to the operand size. Handle the end case separately, if necessary. If either source or destination cannot be aligned, make the destination aligned and the source misaligned. For rep stos, make the destination aligned.

Inline REP String with Low Counts

For repeat counts of less than 4k, expand REP string instructions into equivalent sequences of simple AMD64 instructions. Use an inline sequence of loads and stores to accomplish the move. Use a sequence of stores to emulate REP STOS. This technique eliminates the setup overhead of REP instructions and increases instruction throughput.

Use Loop for REP String with Low Variable Counts

If the repeated count is variable, but is likely less than eight, use a simple loop to move/store the data. This technique avoids the overhead of REP MOVS and REP STOS.

168

Integer Optimizations

Chapter 8

Page 184
Image 184
AMD 250 manual Use the Largest Possible Operand Size, 168