Intel IXC1100, IXP42X 3.10.4 Cache and Prefetch Optimizations

Intel® IXP42X product line and IXC1100 control plane processors—Intel XScale® Processor

Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor

DM September 2006

180 Order Number: 252480-006US

3.10.4 Cache and Prefetch Optimizations

This section considers how to use the various cache memories in all their modes and

then examines when and how to use prefetch to improve execution efficiencies.

3.10.4.1 Instruction Cache

The IXP42X product line and IXC1100 control plane processors have separate

instruction and data caches. Only fetched instructions are held in the instruction cache

even though both data and instructions may reside within the same memory space with

each other. Functionally, the instruction cache is either en abled or disabled. There is no

performance benefit in not using the instruction cache. The exception is that code,

which locks code into the instruction cache, must itself execute from non-cached

memory.

3.10.4.1.1 Cache Miss Cost

The IXP42X product line and IXC1100 control plane processors’ performance is highly

dependent on reducing the cache miss rate.

Note that this cycle penalty becomes significant when the Intel XScale processor is

running much faster than external memory. Executing non-cached instructions severely

curtails the processor's performance in this case and it is very important to do

everything possible to minimize cache misses.

For the IXP42X product line and IXC1100 control plane processors, care must be taken

to optimize code to have a maximum cache hit when accesses have been requested to

the Expansion Bus Interface or the PCI Bus Controller. These design recommendations

are due to the latency that may be associated with accessing the PCI Bus Controller

and Expansion Bus Controller. Retries will be issued to the Intel XScale processor until

the requested transaction is completed.

3.10.4.1.2 Round Robin Replacement Cache Policy

Both the data and the instruction caches use a round robin replacement policy to evict

a cache line. The simple consequence of this is that at sometime every line will be

evicted, assuming a non-trivial program. The less obvious consequence is that

predicting when and over which cache lines evictions take place is very difficult to

predict. This information must be gained by experimentation using performance

profiling.

3.10.4.1.3 Code Placement to Reduce Cache Misses

Code placement can greatly affect cache misses. One way to view the cache is to think

of it as 32 sets of 32 bytes, which span an address range of 1,024 bytes. When

running, the code maps into 32 blocks modular 1,024 of cache space. Any sets, which

are overused, will thrash the cache. The ideal situation is for the software tools to

distribute the code on a temporal evenness over this space.

This is very difficult if not impossible for a compiler to do. Most of the input needed to

best estimate how to distribute the code will come from profiling followed by compiler

based two pass optimizations.

3.10.4.1.4 Locking Code into the Instruction Cache

One very important instruction cache feature is the ability to lock code into the

instruction cache. Once locked into the instruction cache, the code is always available

for fast execution. Another reason for locking critical code into cache is that with the

round robin replacement policy, eventually the code will be evicted, even if it is a very

frequently executed function. Key code components to consider for locking are: