362 SSE and SSE2 Optimizations Appendix E
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
E.6 SSE and SSE2 Copy Loops
Optimization
When copying data of an unknown format using the XMM registers, it is best to use INT loads and
stores.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the
following pattern—Load, Store, Load, Store, Load, Store, etc.
If in 32-bit mode and using MMX instructions to perform loads and stores, they should be arranged in
the following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc.
Example
The following example illustrates a sequence of 128-bit loads and stores:
movdqa xmm0, [rdx+r8*8] ; Load
movntdq [rcx+r8*8], xmm0 ; Store
movdqa xmm1, [rdx+r8*8+16] ; Load
movntdq [rcx+r8*8+16], xmm1 ; Store