Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Level 2

Cache

L2 ECC

L2 Tags

L2 Tag ECC

System Request

Queue (SRQ)

Cross Bar

Instruction

Level 1 Instruction Cache

 

 

TLB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fetch 2 Transit

 

 

 

 

 

 

 

 

 

Pick

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Decode 1

 

 

Decode 1

 

 

Decode 1

 

 

 

 

 

 

 

 

Decode 2

 

 

Decode 2

 

 

Decode 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pack

 

 

 

Pack

 

 

 

Pack

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Decode

 

 

Decode

 

 

Decode

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8-Entry

 

 

 

8-Entry

 

 

 

8-Entry

 

 

Scheduler

 

 

Scheduler

 

 

Scheduler

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Branch Selectors

(4K)

16K 2-bit

Global History

Counters

Target Array

(2K Targets)

and

Return Address

Stack

(12 Entries)

36-Entry Scheduler

(XBAR)

Memory Controler

and

HypertransportTM

Technology

AGU

ALU

AGU

ALU

AGU

ALU

FADD

FMUL

FMISC

 

 

 

 

 

 

Data

Level 1 Data Cache

ECC

 

 

TLB

 

 

 

 

 

 

 

 

 

 

Figure 6. AMD Athlon™ 64 and AMD Opteron™ Processors Block Diagram

A.5 L1 Instruction Cache

The out-of-order execution engine of the AMD Athlon 64 and AMD Opteron processors contains a very large L1 instruction cache. Each line in this cache is 64 bytes long. Functions associated with the L1 instruction cache are instruction loads, instruction prefetching, instruction predecoding, and branch prediction. Requests that miss in the L1 instruction cache are fetched from the L2 cache or, subsequently, from the local memory using the integrated memory controller.

The L1 instruction cache generates fetches on the naturally aligned 64 bytes containing the instructions and the next sequential line of 64 bytes (a prefetch). The principle of program-spatial locality makes code prefetching very effective and avoids or reduces execution stalls caused by the amount of time required to read the necessary code. Cache-line replacement is based on a least- recently-used replacement algorithm.

252

Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors Appendix A

Page 268
Image 268
AMD 250 manual L1 Instruction Cache, AMD Athlon 64 and AMD Opteron Processors Block Diagram