Chapter 5 Cache and Memory Optimizations 109

Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005

requests are “in flight” to keep the memory system busy all of the time. The AMDAthlon 64 and

AMD Opteron processors support a maximum of eight concurrent memory requests to different cache

lines. Multiple requests to the same cache line count as only one towards this limit of eight.

Figure 4. Memory-Limited Code

Code that performs many computations on each cache line is limited by processor speed rather than

memory bandwidth, as shown in Figure 5. In this case, the goal of software prefetching is just to

ensure that the memory data is available when the processor needs it. As the processor speed

increases, the optimal prefetch distance increases until the memory bandwidth becomes the limiting

factor.

For an example of how to use software prefetching in processor-limited code, see “Structuring Code

with Prefetch Instructions to Hide Memory Latency” on page 200.

Figure 5. Processor-Limited Code

M1 M5M2 M3 M4

C0 C1 C2 C3 C4

Total Memory Latency
Prefetchnta [ esi + 64 *
4
]
memory burst t ime
(one 64-byte cache li ne)

Memory

cycles

CPU

loops

Prefetch distance i s
~4 cache lines ahead
time

M1 M5M2 M3 M4

C0 C1 C2 C3 C4

Total Memory Latency
Prefetchnta [ esi + 64 *
4
]
memory burst t ime
(one 64-byte cache li ne)

Memory

cycles

CPU

loops

Prefetch distance i s
~4 cache lines ahead
time
C1 C5C2 C3 C4M1 M2 M3 M4 M5
Total Memory Latency
Prefetchnta [ esi + 64 * 2]
memory burst t ime
(one 64-byte cache li ne)
MemorycyclesCPUloops
Prefetch distance i s
~2 cache lines ahead
(maybe use 3 for safety)
time
CPU time
(process one cache line)
C1 C5C2 C3 C4M1 M2 M3 M4 M5
Total Memory Latency
Prefetchnta [ esi + 64 * 2]
memory burst t ime
(one 64-byte cache li ne)
MemorycyclesCPUloops
Prefetch distance i s
~2 cache lines ahead
(maybe use 3 for safety)
time
CPU time
(process one cache line)