25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

requests are “in flight” to keep the memory system busy all of the time. The AMD Athlon 64 and AMD Opteron processors support a maximum of eight concurrent memory requests to different cache lines. Multiple requests to the same cache line count as only one towards this limit of eight.

memory burst time

time

(one 64-byte cache line)

 

Memory cycles

CPU loops

M1

 

M2

M3

M4

M5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Total Memory Latency

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C0

 

 

C1

 

C2

 

C3

 

C4

Prefetchnta [ esi + 64 * 4]

Prefetch distance is

~4 cache lines ahead

Figure 4. Memory-Limited Code

Code that performs many computations on each cache line is limited by processor speed rather than memory bandwidth, as shown in Figure 5. In this case, the goal of software prefetching is just to ensure that the memory data is available when the processor needs it. As the processor speed increases, the optimal prefetch distance increases until the memory bandwidth becomes the limiting factor.

For an example of how to use software prefetching in processor-limited code, see “Structuring Code with Prefetch Instructions to Hide Memory Latency” on page 200.

memory burst time

time

(one 64-byte cache line)

 

Memory

cycles

CPU loops

 

M1

 

 

M2

 

M3

 

 

M4

 

M5

 

 

 

 

Total Memory Latency

 

 

 

CPU time

 

 

 

 

 

 

 

 

 

 

 

 

(process one cache line)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C1

 

C2

C3

 

C4

C5

 

Prefetch distance is

Prefetchnta [ esi + 64 * 2] ~2 cache lines ahead (maybe use 3 for safety)

Figure 5. Processor-Limited Code

Chapter 5

Cache and Memory Optimizations

109

Page 125
Image 125
AMD 250 manual Memory-Limited Code, Processor-Limited Code