Intel NetBurst - page 12

A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor

Page 12

Prefetching

The Intel NetBurst micro-architecture supports three prefetching mechanisms:

§ the first is for instructions only

§ the second is for data only

§ the third is for code or data.

The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is a

software-controlled mechanism that fetches data into the caches using the prefetch instructions. The third is a

hardware mechanism that automatically fetches data and instruction into the unified second-level cache.

The hardware instruction fetcher reads instructions along the path predicted by the BTB into the instruction

streaming buffers. Data is read in 32-byte chunks starting at the target address. The second and third mechanisms is

described in Data Prefetch.

Decoder

The front end of the Intel NetBurst micro-architecture has a single decoder that can decode instructions at the

maximum rate of one instruction per clock. Complex instruction must enlist the help of the microcode ROM. The

decoder operation is connected to the execution trace cache discussed in the section that follows.

Execution Trace Cache

The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst micro-architecture. The TC

stores decoded IA-32 instructions, or µops. This removes decoding costs on frequently-executed code, such as

template restrictions and the extra latency to decode instructions upon a branch misprediction.

In the Pentium 4 processor implementation, the TC can hold up to 12K µops and can deliver up to three µops per

cycle. The TC does not hold all of the µops that need to be executed in the execution core. In some situations, the

execution core may need to execute a microcode flow, instead of the µop traces that are stored in the trace cache.

The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the trace

cache, efficiently and continuously, while only a few instructions involve the microcode ROM.

Branch Prediction

Branch prediction is very important to the performance of a deeply pipelined processor. Branch prediction enables

the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty

that is incurred in the absence of a correct prediction. For Pentium 4 processor, the branch delay for a correctly

predicted instruction can be as few as zero clock cycles. The branch delay for a mispredicted branch can be many

cycles; typically this is equivalent to the depth of the pipeline.

The branch prediction in the Intel NetBurst micro-architecture predicts all near branches, including conditional,

unconditional calls and returns, and indirect branches. It does not predict far transfers, for example, far calls, irets,

and software interrupts.

In addition, several mechanisms are implemented to aid in predicting branches more accurately and in reducing the

cost of taken branches:

§ dynamically predict the direction and target of branches based on the instructions’ linear address using the

branch target buffer (BTB)

§ if no dynamic prediction is available or if it is invalid, statically predict the outcome based on the offset of the

target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken

§ return addresses are predicted using the 16-entry return address stack

§ traces of instructions are built across predicted taken branches to avoid branch penalties.