Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

CPU 0

 

 

Victim Buffer (8-entry)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Write Buffer (4-entry)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Instruction MAB (2-entry)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Data MAB (8-entry)

 

 

 

 

 

 

 

All buffers are 64-bit

 

 

 

 

 

 

 

 

 

 

 

 

CPU 1

 

 

 

command/address

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

System Request

 

 

 

 

 

 

 

 

 

 

 

 

 

 

to

 

 

Queue

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CPU

 

 

24-entry

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

HyperTransport 0

HyperTransport 1

HyperTransport 2

 

Memory

 

to

 

 

 

 

 

 

 

 

Command

 

 

 

 

 

 

 

 

 

Input

Input

Input

 

 

 

DCT

 

 

 

 

Address MAP

 

 

 

 

Queue

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

& GART

 

 

 

 

 

 

 

 

 

 

 

20-entry

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Router

Router

Router

Router

Router

10-entry Buffer

16-entry Buffer

16-entry Buffer

16-entry Buffer

12-entry Buffer

XBAR

 

 

 

 

Hypertransport 0

HyperTransport 1

HyperTransport 2

Output

Output

Output

Figure 12. Northbridge Command Flow

D.5 Memory Optimizations for Graphics-Engine Programming Using the DMA Model

Historically (that is, with AGP 1.0 and AGP 2.0), AGP memory used for command DMA buffers was

accessed by the processor through the AGP aperture space (this feature is referred to as host translation). This address space was mapped as write-combining due to the fact that the processor’s

caches were not snooped by an AGP master (that is, coherency was not enforced for AGP memory). Write-combining offered the best bandwidth in this situation because write-combining buffers could be sent to system memory as full write-combining buffers. However, system memory still needed to be written, which used memory bandwidth.

On current systems however, coherency between an AGP master (making accesses through the AGP aperture) and the processor caches is maintained due to the HyperTransport protocol and the MOESI (modified, owner, exclusive, shared, invalid) caching policy. Coherency support between an AGP master and the processor caches is enabled through a bit in the GART entry (Gart_entry.coh). The AGP miniport driver sets this bit as it maps entries in the GART. The video graphics miniport driver can verify this feature in the AGP 3.0-compliant register (AGPSTAT.ita_entry.coh), which is found in the AGP bridge device.

Note: Coherency support is implemented by hardware in AMD Athlon 64 and AMD Opteron processors, and is not specific to the AGP tunnel device, even though the support is indicated in the tunnel’s AGP 3.0-compliant register (AGPSTAT.ita_entry.coh).

Therefore, a key optimization for the DMA model on AMD Athlon 64 and AMD Opteron processors is that the AGP master may read the data from the processor caches faster than reading data from the DDR memory, since the processor caches operate at higher clock frequencies. As processor clock

352

AGP Considerations

Appendix D

Page 368
Image 368
AMD 250 manual Northbridge Command Flow, 352