A Detailed Look Inside the Intel® NetBurst Micro-Architecture of the Intel Pentium® 4 Processor
Page 11
µops called traces, which are stored in the execution trace cache. The execution trace cache stores these µops in the
path of program execution flow, where the results of branches in the code are integrated into the same cache line.
This increases the instruction flow from the cache and makes better use of the overall cache storage space since the
cache no longer stores instructions that are branched over and never executed. The execution trace cache can deliver
up to 3 µops per clock to the core.
The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch targets
are predicted based on their linear address using branch prediction logic and fetched as soon as possible. Branch
targets are fetched from the execution trace cache if they are cached there, otherwise they are fetched from the
memory hierarchy. The translation engine’s branch prediction information is used to form traces along the most
likely paths.
The Out-of-Order Core
The core’s ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the
processor to reorder instructions so that if one µop is delayed while waiting for data or a contended resource, other
µops that appear later in the program order may proceed around it. The processor employs several buffers to smooth
the flow of µops. This implies that when one portion of the entire processor pipeline experiences a delay, that delay
may be covered by other operations executing in parallel (for example, in the core) or by the execution of µops
which were previously queued up in a buffer (for example, in the front end).
The delays described in this paper must be understood in this context. The core is designed to facilitate parallel
execution. It can dispatch up to six µops per cycle through the issue ports. (The issue ports are shown in Figure 4.)
Note that six µops per cycle exceeds the trace cache and retirement µop bandwidth. The higher bandwidth in the
core allows for peak bursts of greater than 3 µops and to achieve higher issue rates by allowing greater flexibility in
issuing µops to different execution ports.
Most execution units can start executing a new µop every cycle, so that several instructions can be in flight at a time
for each pipeline. A number of arithmetic logical unit (ALU) instructions can start two per cycle, and many floating-
point instructions can start one every two cycles. Finally, µops can begin execution, out of order, as soon as their
data inputs are ready and resources are available.
Retirement
The retirement section receives the results of the executed µops from the execution core and processes the results so
that the proper architectural state is updated according to the original program order. For semantically-correct
execution, the results of IA-32 instructions must be committed in original program order before it is retired.
Exceptions may be raised as instructions are retired. Thus, exceptions cannot occur speculatively, they occur in the
correct order, and the machine can be correctly restarted after an exception.
When a µop completes and writes its result to the destination, it is retired. Up to three µops may be retired per cycle.
The Reorder Buffer (ROB) is the unit in the processor which buffers completed µops, updates the architectural state
in order, and manages the ordering of exceptions.
The retirement section also keeps track of branches and sends updated branch target information to the Branch
Target Buffer (BTB) to update branch history. Figure 3 illustrates the paths that are most frequently executing
inside the Intel NetBurst micro-arachitecture: an execution loop that interacts with multi-level cache hierarchy and
the system bus.
The following sections describe in more detail the operation of the front end and the execution core.
Front End Pipeline Detail
The following information about the front end operation may be useful for tuning software with respect to
prefetching, branch prediction, and execution trace cache operations.