Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

5.6Prefetch Instructions

Optimization

Where appropriate, use one of the prefetch instructions to increase the effective bandwidth of the AMD Athlon 64 and AMD Opteron processors.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

Prefetch instructions take advantage of the high bus bandwidth of the AMD Athlon 64 and AMD Opteron processors to hide latencies when fetching data from system memory. A prefetch instruction initiates a read request of a specified address and reads the entire cache line that contains that address.

AMD Athlon 64 and AMD Opteron processors perform three types of prefetches:

Prefetch type

Description

 

 

Load

Reads the data into the L1 data cache; the data is later evicted to the L2 cache. The

 

following instructions perform load prefetches: PREFETCH, PREFETCHT0,

 

PREFETCHT1, and PREFETCHT2.

 

 

Store

Reads the data into the L1 data cache and marks the data as modified; the data is

 

later evicted to the L2 cache. The PREFETCHW instruction performs a store prefetch.

 

 

Nontemporal

The PREFETCHNTA instruction performs a nontemporal prefetch. The data is read

 

into the L1 data cache; to avoid cache pollution, when a PREFETCHNTA misses in

 

the L2 cache and reads from memory, the data is never evicted to the L2 cache. When

 

a PREFETCHNTA hits in the L2 cache, the data is evicted back to the L2 cache. AMD

 

Athlon 64 and AMD Opteron processors prior to Revision E read data into one way of

 

the L1 cache when the PREFETCHNTA instruction was used. Revision E processors

 

read PREFETCHNTA data into both ways of the L1 cache.

 

 

The prefetch instructions can be used anywhere, in any type of code. The use of prefetch instructions is not affected by the values of Control Register 0 (CR0) bits, such as CR0.EM and CR0.TS.

Prefetching versus Preloading

In code that makes irregular memory accesses rather than sequential accesses, an ordinary MOV instruction is the best way to load data. But in situations where sequential addresses are read, prefetch

104

Cache and Memory Optimizations

Chapter 5

Page 120
Image 120
AMD 250 manual Prefetch Instructions, Prefetching versus Preloading, 104