106 Cache and Memory Optimizations Chapter 5
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Note: PREFETCHNTA should NOT be used for large arrays that are only being written, not read.
In such cases, write-combining stores should be used. (See “Write-combining” on page113,
Appendix B “Implementation of Write-Combining” on page263, and “Write-Combining” in
Volume 2 of the AMD64 Architecture Programmer’s Manual (order no. 24593).)
Current AMD Athlon 64 and AMD Opteron processors implement the PREFETCHT0,
PREFETCHT1 and PREFETCHT2 instructions in exactly the same way as the PREFETCH
instructions. That is, the data is brought into the L1 data cache. This functionality could be changed in
future implementations.
PREFETCHW versus PREFETCH
Code that intends to modify the cache line that is brought in through prefetching should use the
PREFETCHW instruction. PREFETCHW gives a hint to the AMDAthlon 64 and AMD Opteron
processors of an intent to modify the cache line. The AMDAthlon 64 and AMD Opteron processors
mark the cache line being read by PREFETCHW as modified. Using PREFETCHW can save
additional cycles compared to PREFETCH, and avoid the subsequent cache state change caused by a
write to the prefetched cache line. Only use PREFETCHW if there is a write to the same cache line
afterwards.
Write-Combining Usage
Use write-combining instructions instead of PREFETCHW in situations where all of the following
conditions are true:
The code will overwrite one or more complete cache lines with new data.
The new data will not be used again soon.
Write-combining instructions include the SSE and SSE2 instructions MOVNTDQ, MOVNTI,
MOVNTPS, and MOVNTPD. They also include the MMX instruction MOVNTQ.
Write-combining instructions can dramatically improve memory-write performance. They write data
directly to memory through write-combining buffers, bypassing the cache. This is faster than
PREFETCHW because data does not need to be initially read from memory to fill the cache lines,
only to be completely overwritten shortly thereafter. The new data is simply written to memory,
replacing the old data in memory, so no memory read is performed.
One application where write-combining is useful, often in conjunction with prefetch instructions, is in
copying large blocks of memory.
Note: The write-combining instructions are not recommended or necessary for write-combined
memory regions since the processor will automatically combine writes for those regions.
Write-combine memory types are indicated through the MTRRs and the page-attribute table
(PAT).
Note: For best performance, do not mix write-combining instructions on a cache line with non-
write-combining store instructions.