Software Optimization Guide for AMD64 Processors | 25112 Rev. 3.06 September 2005 |
Use the Largest Possible Operand Size
Always move data using the largest operand size possible. For example, use REP MOVSD rather than
REP MOVSW, and REP MOVSW rather than REP MOVSB. Use REP STOSD rather than REP STOSW, and REP STOSW rather than REP STOSB.
In
REP MOVSQ and REP STOSQ).
Ensure DF = 0 (Increment)
Always make sure that DF is 0 (increment) after execution of CLD for rep movs and rep stos. DF = 1 (decrement) is only needed for certain cases of overlapping rep movs (for example, source and destination overlap).
While string instructions with DF = 1 (decrement) are slower, only the overhead part of the cycle equation is larger and not the throughput part. See Table 6 on page 167 for additional latency numbers.
Align Source and Destination with Operand Size
For rep movs, make sure that both the source and destination are aligned with regard to the operand size. Handle the end case separately, if necessary. If either source or destination cannot be aligned, make the destination aligned and the source misaligned. For rep stos, make the destination aligned.
Inline REP String with Low Counts
For repeat counts of less than 4k, expand REP string instructions into equivalent sequences of simple AMD64 instructions. Use an inline sequence of loads and stores to accomplish the move. Use a sequence of stores to emulate REP STOS. This technique eliminates the setup overhead of REP instructions and increases instruction throughput.
Use Loop for REP String with Low Variable Counts
If the repeated count is variable, but is likely less than eight, use a simple loop to move/store the data. This technique avoids the overhead of REP MOVS and REP STOS.
168 | Integer Optimizations | Chapter 8 |