A Detailed Look Inside the Intel® NetBurst Micro-Architecture of the Intel Pentium® 4 Processor
Page 16
b) avoiding the need to access off-chip caches, which can increase the realized bandwidth compared to a
normal load-miss, which returns data to all cache levels.
The situations that are less likely to benefit from software-controlled data prefetch are the following:
§ In cases that are already bandwidth bound, prefetching tends to increase bandwidth demands, and thus not be
effective.
§ Prefetching too far ahead may cause eviction of cached data from the caches prior to actually being used in
execution; not prefetching far enough ahead can reduce the ability to overlap memory and execution latencies.
§ When the prefetch can only be usefully placed in locations where the likelihood of that prefetch’s getting used
is low. Prefetches consume resources in the processor and the use of too many prefetches can limit their
effectiveness. Examples of this include prefetching data in a loop for a reference outside the loop, and
prefetching in a basic block that is frequently executed, but which seldom precedes the reference for which the
prefetch is targeted.
Automatic hardware prefetch is a new feature in the Pentium 4 processor. It can bring cache lines into the unified
second-level cache based on prior reference patterns.
Pros and Cons of Software and Hardware Prefetching. Software prefetching has the following characteristics:
§ Handles irregular access patterns, which would not trigger the hardware prefetcher
§ Handles prefetching of short arrays and avoids hardware prefetching’s start-up delay before initiating the
fetches
§ Must be added to new code; does not benefit existing applications.
In comparison, hardware prefetching for Pentium 4 processor has the following characteristics:
§ Works with existing applications
§ Requires regular access patterns
§ Has a start-up penalty before the hardware prefetcher triggers and begins initiating fetches. This has a larger
effect for short arrays when hardware prefetching generates a request for data beyond the end of an array,
which is not actually utilized. However, software prefetching can recognize and handle these cases by using
fetch bandwidth to hide the latency for the initial data in the next array. The penalty diminishes if it is
amortized over longer arrays.
§ Avoids instruction and issue port bandwidth overhead.
Loads and Stores
The Pentium 4 processor employs the following techniques to speed up the execution of memory operations:
§ speculative execution of loads
§ reordering of loads with respect to loads and stores
§ multiple outstanding misses
§ buffering of writes
§ forwarding of data from stores to dependent loads.
Performance may be enhanced by not exceeding the memory issue bandwidth and buffer resources provided by the
machine. Up to one load and one store may be issued each cycle from the memory port’s reservation stations. In
order to be dispatched to the reservation stations, there must be a buffer entry available for that memory operation.
There are 48 load buffers and 24 store buffers. These buffers hold the µop and address information until the
operation is completed, retired, and deallocated.
The Pentium 4 processor is designed to enable the execution of memory operations out of order with respect to other
instructions and with respect to each other. Loads can be carried out speculatively, that is, before all preceding