Prefetch Instructions, Prefetching versus Preloading, 104

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

5.6Prefetch Instructions

Optimization

Where appropriate, use one of the prefetch instructions to increase the effective bandwidth of the AMD Athlon 64 and AMD Opteron processors.

Application

This optimization applies to:

•32-bit software

•64-bit software

Rationale

Prefetch instructions take advantage of the high bus bandwidth of the AMD Athlon 64 and AMD Opteron processors to hide latencies when fetching data from system memory. A prefetch instruction initiates a read request of a specified address and reads the entire cache line that contains that address.

AMD Athlon 64 and AMD Opteron processors perform three types of prefetches:

Prefetch type	Description

Load	Reads the data into the L1 data cache; the data is later evicted to the L2 cache. The
	following instructions perform load prefetches: PREFETCH, PREFETCHT0,
	PREFETCHT1, and PREFETCHT2.

Store	Reads the data into the L1 data cache and marks the data as modified; the data is
	later evicted to the L2 cache. The PREFETCHW instruction performs a store prefetch.

Nontemporal	The PREFETCHNTA instruction performs a nontemporal prefetch. The data is read
	into the L1 data cache; to avoid cache pollution, when a PREFETCHNTA misses in
	the L2 cache and reads from memory, the data is never evicted to the L2 cache. When
	a PREFETCHNTA hits in the L2 cache, the data is evicted back to the L2 cache. AMD
	Athlon 64 and AMD Opteron processors prior to Revision E read data into one way of
	the L1 cache when the PREFETCHNTA instruction was used. Revision E processors
	read PREFETCHNTA data into both ways of the L1 cache.

The prefetch instructions can be used anywhere, in any type of code. The use of prefetch instructions is not affected by the values of Control Register 0 (CR0) bits, such as CR0.EM and CR0.TS.

Prefetching versus Preloading

In code that makes irregular memory accesses rather than sequential accesses, an ordinary MOV instruction is the best way to load data. But in situations where sequential addresses are read, prefetch

104

Cache and Memory Optimizations

Chapter 5

AMD 250 manual Prefetch Instructions, Prefetching versus Preloading, 104

Models: 250