Streaming-Store/Non-Temporal Instructions, Optimization

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

5.7Streaming-Store/Non-Temporal Instructions

Optimization

Use streaming store instructions such as MOVNTPS and MOVNTQ when writing arrays or buffers which do not need to reside in cache. These instructions allow the processor to perform a write without first reading the data from memory or other processor's caches. This saves the time needed to read the cache line, and also prevents evicting data from the cache which may be needed. This can be a significant performance advantage. These instructions are available in most compilers using inline assembly or intrinsics. Routines 5 and 6 in Section 5.13, “Appropriate Memory Copying Routines” illustrate using the combination of streaming store instructions with the PREFETCHNTA instruction to optimize memory copy routines.

Application

This optimization applies to:

•32-bit software

•64-bit software

Rationale

Streaming store instructions are also sometimes called write-combining instructions. In order to improve system performance, the AMD Athlon 64 and AMD Opteron processors aggressively combine multiple memory-write cycles of any data size that address locations within a 64-byte cache- line-aligned write buffer if a streaming-store instruction is used. This combining is accomplished with write-combine buffers. The number of write-combine buffers is processor-implementation dependent. Be sure to refer to Appendix B for much more detailed information on write-combining.

Be sure to follow the last streaming-store instruction in a block of code with the MFENCE instruction to assure that all of the write-combine buffers are written to memory.

Streaming Store instructions are also discussed in “Write-Combining Usage” on page 106. Also see

Appendix B, "Implementation of Write-Combining." For more information on write-combining, see "Write-Combining" in the AMD64 Architecture Programmer's Manual Volume 2: System Programming (order# 24593).

112

Cache and Memory Optimizations

Chapter 5

AMD 250 manual Streaming-Store/Non-Temporal Instructions, Optimization

Models: 250