Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

E.6 SSE and SSE2 Copy Loops

Optimization

When copying data of an unknown format using the XMM registers, it is best to use INT loads and stores.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the following pattern—Load, Store, Load, Store, Load, Store, etc.

If in 32-bit mode and using MMX instructions to perform loads and stores, they should be arranged in the following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc.

Example

The following example illustrates a sequence of 128-bit loads and stores:

movdqa

xmm0,

[rdx+r8*8]

; Load

movntdq

[rcx+r8*8], xmm0

; Store

movdqa

xmm1,

[rdx+r8*8+16]

;

Load

movntdq

[rcx+r8*8+16], xmm1

;

Store

362

SSE and SSE2 Optimizations

Appendix E

Page 378
Image 378
AMD 250 manual SSE and SSE2 Copy Loops, 362