Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Note: PREFETCHNTA should NOT be used for large arrays that are only being written, not read. In such cases, write-combining stores should be used. (See “Write-combining” on page 113, Appendix B “Implementation of Write-Combining” on page 263, and “Write-Combining” in Volume 2 of the AMD64 Architecture Programmer’s Manual (order no. 24593).)

Current AMD Athlon 64 and AMD Opteron processors implement the PREFETCHT0, PREFETCHT1 and PREFETCHT2 instructions in exactly the same way as the PREFETCH instructions. That is, the data is brought into the L1 data cache. This functionality could be changed in future implementations.

PREFETCHW versus PREFETCH

Code that intends to modify the cache line that is brought in through prefetching should use the PREFETCHW instruction. PREFETCHW gives a hint to the AMD Athlon 64 and AMD Opteron

processors of an intent to modify the cache line. The AMD Athlon 64 and AMD Opteron processors mark the cache line being read by PREFETCHW as modified. Using PREFETCHW can save

additional cycles compared to PREFETCH, and avoid the subsequent cache state change caused by a write to the prefetched cache line. Only use PREFETCHW if there is a write to the same cache line afterwards.

Write-Combining Usage

Use write-combining instructions instead of PREFETCHW in situations where all of the following conditions are true:

The code will overwrite one or more complete cache lines with new data.

The new data will not be used again soon.

Write-combining instructions include the SSE and SSE2 instructions MOVNTDQ, MOVNTI, MOVNTPS, and MOVNTPD. They also include the MMX instruction MOVNTQ.

Write-combining instructions can dramatically improve memory-write performance. They write data directly to memory through write-combining buffers, bypassing the cache. This is faster than PREFETCHW because data does not need to be initially read from memory to fill the cache lines, only to be completely overwritten shortly thereafter. The new data is simply written to memory, replacing the old data in memory, so no memory read is performed.

One application where write-combining is useful, often in conjunction with prefetch instructions, is in copying large blocks of memory.

Note: The write-combining instructions are not recommended or necessary for write-combined memory regions since the processor will automatically combine writes for those regions. Write-combine memory types are indicated through the MTRRs and the page-attribute table (PAT).

Note: For best performance, do not mix write-combining instructions on a cache line with non- write-combining store instructions.

106

Cache and Memory Optimizations

Chapter 5

Page 122
Image 122
AMD 250 manual Prefetchw versus Prefetch, Write-Combining Usage, 106