Intel 80200, Processor B.4.4.2. Prefetch Loop Scheduling, B.4.4.3. Prefetch Loop Limitations, B.4.4.4. Compute vs. Data Bus Bound, B.4.4.5. Low Number of Iterations

Developer’s Manual March, 2003 B-27

Intel® 80200 Processor based on Intel® XScale™ Microarchitecture

Optimization Guide

B.4.4.2. Prefetch Loop Scheduling

When adding prefetch to a loop which operates on arrays, it may be advantages to prefetch ahead

one, two, or more iterations. The data for future iterations is located in memory by a fixed offset

from the data for the current iteration. This makes it easy to predict where to fetch the data. The

number of iterations to prefetch ahead is refereed to as the prefetch scheduling distance or psd. For

the Intel® 80200 processor this can be calculated as:

Where:

Npref Is the number of cache lines to be prefetched for both reading and writing.

Nevict Is the number of cache half line evictions caused by the loop.

Ninst Is the number of instructions executed in one iteration of the loop

NhwlinexferThis is the number of core clocks required to write half a cache line as would happen if

only one of the cache line dirty bits were set when a line eviction occurred. For the

Intel® 80200 processor this takes 2 bus clocks or 12 core clocks.

CPI This is the average number of core clocks per instruction.

The psd number provided by the above equation is a good starting point, but may not be the most

ideal consideration. Estimating Nevict is very difficult from static code. However, if the operational

data uses the mini-data cache and if the loop operations should overflow the mini-data cache, then

a first order estimate of Nevict would be the number of bytes written pre loop iteration divided by a

half cache line size of 16 bytes. Cache overflow can be estimated by the number of cache lines

transferred each iteration and the number of expected loop iterations. Nevict and CPI can be

estimated by profiling the code using the performance monitor “cache write-back” event count.

B.4.4.3. Prefetch Loop Limitations

It is not always advantages to add prefetch to a loop. Loop characteristics that limit the use value of

prefetch are discussed below.

B.4.4.4. Compute vs. Data Bus Bound

At the extreme, a loop, which is data bus bound, does not benefit from prefetch because all the

system resources to transfer data are quickly allocated and there are no instructions that can

profitably be executed. On the other end of the scale, compute bound loops allow complete hiding

of all data transfer latencies.

B.4.4.5. Low Number of Iterations

Loops with very low iteration counts may have the advantages of prefetch completely mitigated. A

loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather

than trying to schedule prefetch instructions.

psd floor Nlookup Nlinexfer Npref

×Nhwlinexfer Nevict

×++()

CPI Ninst

×()

-------------------------------------------- -------------------------------------------- ------------------------------------



