Appendix D AGP Considerations 349
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
movdqa [rax + 48], xmm3
/* Allocate and fill another WC buffer. */
movdqa [rax + 64], xmm4
movdqa [rax + 80], xmm5
movdqa [rax + 96], xmm6
/* The second WC buffer is forced after the next write. */
/* The linear ascending order between cache lines */
/* is maintained since buffer is sent when filled. */
movdqa [rax + 112], xmm7
SFENCE
/* The SFENCE forces the write-combining buffer */
/* out of the processor and to the graphics chip. */
/* Set up the next drawing commands in cached */
/* memory structure ShadowRegs_Structure. */
D.3 Fast-Write Optimizations for Video-Memory Copies
When performing block copies of an image to the graphics accelerator’s local memory, you can
preserve the contents of the L1 and L2 caches and reduce cache-line-replacement traffic to system
memory by using a nontemporal block prefetch on the image data using the PREFETCHNTA
instruction. This works well with images loaded into system memory through disk DMA because the
data can be kept out of the L2 cache and mostly out of the L1 data cache (when using
PREFETCHNTA). This is illustrated in Listing32
Note: On the AMD Athlon™ 64 and AMD Opteron™ processors, PREFETCHNTA uses one way of
the two-way set-associative L1 data cache. One way of the L1 data cache is 32 Kbytes, so
limit the block prefetch size to less than or equal to 32Kbytes.
Listing 32. Writing Nontemporal Data to Video RAM
/* Copy an image larger than 32 Kbytes into local memory, */
/* but limit the block prefetch so as not to exceed 32 Kbytes, */
/* which is the size of the nontemporal cache. */
/* First, block prefetch 16 Kbytes into the L1 data cache, then write */
/* it to the frame buffer. */
/* On AMD Athlon 64 and AMD Opteron processors, the PREFETCHNTA instruction must
execute prior */
/* to subsequent instructions. */
/* Cache lines that are prefetched via PREFETCHNTA and later replaced are */
/* not evicted to the L2 cache or system memory. */