Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

5.16Interleave Loads and Stores

When loading and storing data as in a copy routine, the organization of the sequence of loads and stores can affect performance.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the following pattern—Load, Store, Load, Store, Load, Store, etc. This enables the processor to maxi- mize the load/store bandwidth.

If using MMX loads and stores in 32-bit mode, the loads and stores should be arranged in the following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc.

Example

The following example illustrates a sequence of 128-bit loads and stores:

movdqa

xmm0,[rdx+r8*8]

; Load

movntdq

[rcx+r8*8],xmm0

; Store

movdqa

xmm1,[rdx+r8*8+16]

;

Load

movntdq

[rcx+r8*8+16],xmm1

;

Store

124

Cache and Memory Optimizations

Chapter 5

Page 140
Image 140
AMD 250 manual Interleave Loads and Stores, 124