A Detailed Look Inside the Intel® NetBurst Micro-Architecture of the Intel Pentium® 4 Processor
Page 12
Prefetching
The Intel NetBurst micro-architecture supports three prefetching mechanisms:
§ the first is for instructions only
§ the second is for data only
§ the third is for code or data.
The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is a
software-controlled mechanism that fetches data into the caches using the prefetch instructions. The third is a
hardware mechanism that automatically fetches data and instruction into the unified second-level cache.
The hardware instruction fetcher reads instructions along the path predicted by the BTB into the instruction
streaming buffers. Data is read in 32-byte chunks starting at the target address. The second and third mechanisms is
described in Data Prefetch.
Decoder
The front end of the Intel NetBurst micro-architecture has a single decoder that can decode instructions at the
maximum rate of one instruction per clock. Complex instruction must enlist the help of the microcode ROM. The
decoder operation is connected to the execution trace cache discussed in the section that follows.
Execution Trace Cache
The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst micro-architecture. The TC
stores decoded IA-32 instructions, or µops. This removes decoding costs on frequently-executed code, such as
template restrictions and the extra latency to decode instructions upon a branch misprediction.
In the Pentium 4 processor implementation, the TC can hold up to 12K µops and can deliver up to three µops per
cycle. The TC does not hold all of the µops that need to be executed in the execution core. In some situations, the
execution core may need to execute a microcode flow, instead of the µop traces that are stored in the trace cache.
The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the trace
cache, efficiently and continuously, while only a few instructions involve the microcode ROM.
Branch Prediction
Branch prediction is very important to the performance of a deeply pipelined processor. Branch prediction enables
the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty
that is incurred in the absence of a correct prediction. For Pentium 4 processor, the branch delay for a correctly
predicted instruction can be as few as zero clock cycles. The branch delay for a mispredicted branch can be many
cycles; typically this is equivalent to the depth of the pipeline.
The branch prediction in the Intel NetBurst micro-architecture predicts all near branches, including conditional,
unconditional calls and returns, and indirect branches. It does not predict far transfers, for example, far calls, irets,
and software interrupts.
In addition, several mechanisms are implemented to aid in predicting branches more accurately and in reducing the
cost of taken branches:
§ dynamically predict the direction and target of branches based on the instructions’ linear address using the
branch target buffer (BTB)
§ if no dynamic prediction is available or if it is invalid, statically predict the outcome based on the offset of the
target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken
§ return addresses are predicted using the 16-entry return address stack
§ traces of instructions are built across predicted taken branches to avoid branch penalties.