Intel NetBurst Front End Pipeline Detail

A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor

Page 11

µops called traces, which are stored in the execution trace cache. The execution trace cache stores these µops in the

path of program execution flow, where the results of branches in the code are integrated into the same cache line.

This increases the instruction flow from the cache and makes better use of the overall cache storage space since the

cache no longer stores instructions that are branched over and never executed. The execution trace cache can deliver

up to 3 µops per clock to the core.

The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch targets

are predicted based on their linear address using branch prediction logic and fetched as soon as possible. Branch

targets are fetched from the execution trace cache if they are cached there, otherwise they are fetched from the

memory hierarchy. The translation engine’s branch prediction information is used to form traces along the most

likely paths.

The Out-of-Order Core

The core’s ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the

processor to reorder instructions so that if one µop is delayed while waiting for data or a contended resource, other

µops that appear later in the program order may proceed around it. The processor employs several buffers to smooth

the flow of µops. This implies that when one portion of the entire processor pipeline experiences a delay, that delay

may be covered by other operations executing in parallel (for example, in the core) or by the execution of µops

which were previously queued up in a buffer (for example, in the front end).

The delays described in this paper must be understood in this context. The core is designed to facilitate parallel

execution. It can dispatch up to six µops per cycle through the issue ports. (The issue ports are shown in Figure 4.)

Note that six µops per cycle exceeds the trace cache and retirement µop bandwidth. The higher bandwidth in the

core allows for peak bursts of greater than 3 µops and to achieve higher issue rates by allowing greater flexibility in

issuing µops to different execution ports.

Most execution units can start executing a new µop every cycle, so that several instructions can be in flight at a time

for each pipeline. A number of arithmetic logical unit (ALU) instructions can start two per cycle, and many floating-

point instructions can start one every two cycles. Finally, µops can begin execution, out of order, as soon as their

data inputs are ready and resources are available.

Retirement

The retirement section receives the results of the executed µops from the execution core and processes the results so

that the proper architectural state is updated according to the original program order. For semantically-correct

execution, the results of IA-32 instructions must be committed in original program order before it is retired.

Exceptions may be raised as instructions are retired. Thus, exceptions cannot occur speculatively, they occur in the

correct order, and the machine can be correctly restarted after an exception.

When a µop completes and writes its result to the destination, it is retired. Up to three µops may be retired per cycle.

The Reorder Buffer (ROB) is the unit in the processor which buffers completed µops, updates the architectural state

in order, and manages the ordering of exceptions.

The retirement section also keeps track of branches and sends updated branch target information to the Branch

Target Buffer (BTB) to update branch history. Figure 3 illustrates the paths that are most frequently executing

inside the Intel NetBurst micro-arachitecture: an execution loop that interacts with multi-level cache hierarchy and

the system bus.

The following sections describe in more detail the operation of the front end and the execution core.

Front End Pipeline Detail

The following information about the front end operation may be useful for tuning software with respect to

prefetching, branch prediction, and execution trace cache operations.