Intel IXC1100, IXP42X 3.10.4.4 Prefetch Considerations

Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor

September 2006 DM

Order Number: 252480-006US 185

Intel XScale® Processor—Intel® IXP42X product line and IXC1100 control plane processors

3.10.4.3.2 Memory Page Thrashing

Memory page thrashing occurs because of the nature of SDRAM. SDRAMs are typically

divided into four banks. Each bank can have one selected page where a page address

size for current memory components is often defined as 4 k. Memory lookup time or

latency time for a selected page address is currently two to three bus clocks. Thrashing

occurs when subsequent memory accesses within the same memory bank access

different pages. The memory page change adds three to four bus clock cycles to

memory latency. This added delay extends the prefetch distance correspondingly

making it more difficult to hide memory access latencies. This type of thrashing can be

resolved by placing the conflicting data structures into different memory banks or by

paralleling the data structures such that the data resides within the same memory

page. It is also extremely important to insure that instruction and data sections are in

different memory banks, or they will continually trash the memory page selection.

3.10.4.4 Prefetch Considerations

The IXP42X product line and IXC1100 control plane processors have a true prefetch

load instruction (PLD). The purpose of this instruction is to preload data into the data

and mini-data caches. Data prefetching allows hiding of memory transfer latency while

the processor continues to execute instructions. The prefetch is important to compiler

and assembly code because judicious use of the prefetch instruction can enormously

improve throughput performance of the IXP42X product line and IXC1100 control plane

processors. Data prefetch can be applied not only to loops but also to any data

references within a block of code. Prefetch also applies to data writing when the

memory type is enabled as write-allocate.

The IXP42X product line and IXC1100 control plane processors’ prefetch load

instruction is a true prefetch instruction because the load destination is the data or

mini-data cache and not a register. Compilers for processors which have data caches,

but do not support prefetch, sometimes use a load instruction to preload the data

cache. This technique has the disadvantages of using a register to load data and

requiring additional registers for subsequent preloads and thus increasing register

pressure. By contrast, the prefetch can be used to reduce register pressure instead of

increasing it.

The prefetch load is a hint instruction and does not guarantee that the data will be

loaded. Whenever the load would cause a fault or a table walk, then the processor will

ignore the prefetch instruction, the fault or table walk, and continue processing the

next instruction. This is particularly advantageous in the case where a linked list or

recursive data structure is terminated by a NULL pointer. Prefetching the NULL pointer

will not fault program flow.

3.10.4.4.1 Prefetch Loop Limitations

It is not always advantages to add prefetch to a loop. Loop characteristics that limit the

use value of prefetch are discussed below.

3.10.4.4.2 Compute versus Data Bus Bound

At the extreme, a loop, which is data bus bound, will not benefit from prefetch because

all the system resources to transfer data are quickly allocated and there are no

instructions that can profitably be executed. On the other end of the scale, compute

bound loops allow complete hiding of all data transfer latencies.

3.10.4.4.3 Low Number of Iterations

Loops with very low iteration counts may have the advantages of prefetch completely

mitigated. A loop with a small fixed number of iterations may be faster if the loop is

completely unrolled rather than trying to schedule prefetch instructions.