SSE and SSE2 Copy Loops, 362

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

E.6 SSE and SSE2 Copy Loops

Optimization

When copying data of an unknown format using the XMM registers, it is best to use INT loads and stores.

Application

This optimization applies to:

•32-bit software

•64-bit software

Rationale

When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the following pattern—Load, Store, Load, Store, Load, Store, etc.

If in 32-bit mode and using MMX instructions to perform loads and stores, they should be arranged in the following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc.

Example

The following example illustrates a sequence of 128-bit loads and stores:

movdqa	xmm0,	[rdx+r8*8]	; Load
movntdq	[rcx+r8*8], xmm0		; Store
movdqa	xmm1,	[rdx+r8*8+16]	;	Load
movntdq	[rcx+r8*8+16], xmm1		;	Store

362

SSE and SSE2 Optimizations

Appendix E

AMD 250 manual SSE and SSE2 Copy Loops, 362

Models: 250