Intel NetBurst - page 16

A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor

Page 16

b) avoiding the need to access off-chip caches, which can increase the realized bandwidth compared to a

normal load-miss, which returns data to all cache levels.

The situations that are less likely to benefit from software-controlled data prefetch are the following:

§ In cases that are already bandwidth bound, prefetching tends to increase bandwidth demands, and thus not be

effective.

§ Prefetching too far ahead may cause eviction of cached data from the caches prior to actually being used in

execution; not prefetching far enough ahead can reduce the ability to overlap memory and execution latencies.

§ When the prefetch can only be usefully placed in locations where the likelihood of that prefetch’s getting used

is low. Prefetches consume resources in the processor and the use of too many prefetches can limit their

effectiveness. Examples of this include prefetching data in a loop for a reference outside the loop, and

prefetching in a basic block that is frequently executed, but which seldom precedes the reference for which the

prefetch is targeted.

Automatic hardware prefetch is a new feature in the Pentium 4 processor. It can bring cache lines into the unified

second-level cache based on prior reference patterns.

Pros and Cons of Software and Hardware Prefetching. Software prefetching has the following characteristics:

§ Handles irregular access patterns, which would not trigger the hardware prefetcher

§ Handles prefetching of short arrays and avoids hardware prefetching’s start-up delay before initiating the

fetches

§ Must be added to new code; does not benefit existing applications.

In comparison, hardware prefetching for Pentium 4 processor has the following characteristics:

§ Works with existing applications

§ Requires regular access patterns

§ Has a start-up penalty before the hardware prefetcher triggers and begins initiating fetches. This has a larger

effect for short arrays when hardware prefetching generates a request for data beyond the end of an array,

which is not actually utilized. However, software prefetching can recognize and handle these cases by using

fetch bandwidth to hide the latency for the initial data in the next array. The penalty diminishes if it is

amortized over longer arrays.

§ Avoids instruction and issue port bandwidth overhead.

Loads and Stores

The Pentium 4 processor employs the following techniques to speed up the execution of memory operations:

§ speculative execution of loads

§ reordering of loads with respect to loads and stores

§ multiple outstanding misses

§ buffering of writes

§ forwarding of data from stores to dependent loads.

Performance may be enhanced by not exceeding the memory issue bandwidth and buffer resources provided by the

machine. Up to one load and one store may be issued each cycle from the memory port’s reservation stations. In

order to be dispatched to the reservation stations, there must be a buffer entry available for that memory operation.

There are 48 load buffers and 24 store buffers. These buffers hold the µop and address information until the

operation is completed, retired, and deallocated.

The Pentium 4 processor is designed to enable the execution of memory operations out of order with respect to other

instructions and with respect to each other. Loads can be carried out speculatively, that is, before all preceding