25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

better throughput since bus efficiency is increased. This is because bus arbitration overhead is lower: only one address/attribute phase is issued per burst in the PCI-X case, and one address/command phase is issued for the AGP Fast Writes case. An illustration of address phase overhead on AGP Fast Writes is provided in Figure 10 on page 346 in Appendix D, AGP Considerations.

For reasons cited in the precding paragraph, to utilize hardware write chaining efficiently, software should flush the CPU write-combining buffer in sequential linear address order, any time a target hardware device is capable of receiving large bursts of CPU write data.

Software should be aware that on AMD64 processors that have multiple write-combining buffers (i.e. Rev. D, and E processors), events that flush the write-combining buffers (see Appendix B, Table 8.) will send out the 64-byte WC buffers in the order that the streams were opened. This means that if the CPU writes to the WC space in the highest 64-byte addressed buffer first (for example address 40h), and then writes to a lower 64-byte buffer next, (for example address 00h), when those buffers are sent by the CPU (by HyperTransport to the tunnel), the highest address 64-byte buffer will be sent first, followed by the second (lower address) 64-byte buffer. Since the addressing is not sequential the tunnel device will not "chain" both 64-byte WC buffers and must issue 2 separate transactions on the target bus.

If the above example were targeted for AGP fast writes, issuing two fast write transactions (rather than issuing one Fast Write transaction) will reduce the bandwidth (data throughput) by 1/3. See Figure 10 on page 346 in Appendix D.

Optimizations

Adhere to the following guidelines to ensure that Revision D and E AMD Athlon 64 and AMD Opteron processors issue WC buffers in sequential address order:

When practical, shadow the data structure in memory (rather than writing the actual WC buffer in MMI/O space), prior to copying the structure to WC MMI/O space. This will also ensure that the write-combining buffers are not emptied prematurely by external events (such as a UC read— perhaps issued by another device driver thread or a hardware interrupt, etc.). Shadowing also ensures that writes that occur to different cache lines in the structure do not send out the WC buffers, since the number of WC buffers that can be open at one time is CPU implementation dependent.

When ready to update the actual WC MMI/O address space, copy the shadowed structure from memory to MMI/O, from the lowest address 64-byte block upward. To do the copy, use discrete loads and stores for up to 64 bytes of data. Use a loop of discrete loads and stores for up to 4KB of data. Up to 32KB use REP MOVS instructions. To do discrete loads use assembly language, or, if available, compiler intrinsic functions available (__movsb(), __movsw(), __movsd()), etc.

In general, using these methods to do the copy will exhibit less overhead in a data movement function than calling a memcpy( ) LIBC function, which is usually optimized for copying larger blocks of memory.

Appendix B

Implementation of Write-Combining

267

Page 283
Image 283
AMD 250 manual Optimizations, 267