Unit-Stride Access, Hardware Prefetching, 105

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

instructions can improve performance. Prefetch instructions only update the L1 data cache and do not update an architectural register. This uses one less register compared to a load instruction.

Unit-Stride Access

Large data sets typically require unit-stride access to ensure that all data pulled in by a prefetch instruction is actually used. Large data sets make use of all data that is read from memory, rather than using only a sparse subset of the memory. If necessary, you should reorganize algorithms or data structures to allow unit-stride access. For a definition of unit-stride access, see “Definitions” on page 110.

Hardware Prefetching

The AMD Athlon 64 and AMD Opteron processors implement a hardware prefetching mechanism. The prefetched data is loaded into the L2 cache. The hardware prefetcher works most efficiently when data is accessed on a cache-line-by-cache-line basis (that is, without skipping cache lines). Cache lines on current AMD Athlon 64 and AMD Opteron processors are 64 bytes, but cache-line size is implementation dependent.

The hardware prefetcher prefetches data that is accessed in an ascending or descending order on a cache-line-by-cache-line basis. For example, when the hardware prefetcher detects an access to cache line l followed by an access to cache line l + 1, it initiates a prefetch of cache line l + 3. Accessing data in increments larger than 64 bytes may fail to trigger the hardware prefetcher because cache lines are skipped. In these cases, software-prefetch instructions should be employed. Note that in some earlier revisions of the AMD Athlon 64 and AMD Opteron processors the hardware prefetcher would only detect ascending accesses.

In some cases, using prefetch instructions on processors with hardware prefetching may slightly reduce performance. In these cases, it may be necessary to remove the prefetch instructions. All current AMD Athlon 64 and AMD Opteron processors have hardware prefetching mechanisms.

PREFETCH/W versus PREFETCHNTA/T0/T1/T2

PREFETCHNTA, PREFETCHT0, PREFETCHT1, and PREFETCHT2 are SSE instructions and are processor-implementation dependent. For the AMD Athlon 64 and AMD Opteron processors, data that is prefetched with the PREFETCHNTA instruction is not placed into the L2 cache when it is evicted unless it was originally in L2 when prefetched.

PREFETCHNTA is intended for non-temporal data that will not be needed again soon. PREFETCHNTA should also be used when reading arrays that are so large that they are larger than the L2 cache. Because of their size, such large arrays will not be available in L2 even if they are needed again, and by feeding them through the L2 cache, other possibly useful data will also be evicted from L2.

Note: The L2 cache size of the processor can be determined by using the CPUID instruction.

Chapters 5 and 9 show examples of how to use the PREFETCHNTA instruction.

Chapter 5

Cache and Memory Optimizations

AMD 250 Unit-Stride Access, Hardware Prefetching, PREFETCH/W versus PREFETCHNTA/T0/T1/T2, 105

Models: 250

Unit-Stride Access

Hardware Prefetching

PREFETCH/W versus PREFETCHNTA/T0/T1/T2

105