Intel® IXP42X product line and IXC1100 control plane processors—Intel XScale® Processor
Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor
DM September 2006
180 Order Number: 252480-006US
3.10.4 Cache and Prefetch Optimizations
This section considers how to use the various cache memories in all their modes and
then examines when and how to use prefetch to improve execution efficiencies.

3.10.4.1 Instruction Cache

The IXP42X product line and IXC1100 control plane processors have separate
instruction and data caches. Only fetched instructions are held in the instruction cache
even though both data and instructions may reside within the same memory space with
each other. Functionally, the instruction cache is either en abled or disabled. There is no
performance benefit in not using the instruction cache. The exception is that code,
which locks code into the instruction cache, must itself execute from non-cached
memory.
3.10.4.1.1 Cache Miss Cost
The IXP42X product line and IXC1100 control plane processors’ performance is highly
dependent on reducing the cache miss rate.
Note that this cycle penalty becomes significant when the Intel XScale processor is
running much faster than external memory. Executing non-cached instructions severely
curtails the processor's performance in this case and it is very important to do
everything possible to minimize cache misses.
For the IXP42X product line and IXC1100 control plane processors, care must be taken
to optimize code to have a maximum cache hit when accesses have been requested to
the Expansion Bus Interface or the PCI Bus Controller. These design recommendations
are due to the latency that may be associated with accessing the PCI Bus Controller
and Expansion Bus Controller. Retries will be issued to the Intel XScale processor until
the requested transaction is completed.
3.10.4.1.2 Round Robin Replacement Cache Policy
Both the data and the instruction caches use a round robin replacement policy to evict
a cache line. The simple consequence of this is that at sometime every line will be
evicted, assuming a non-trivial program. The less obvious consequence is that
predicting when and over which cache lines evictions take place is very difficult to
predict. This information must be gained by experimentation using performance
profiling.
3.10.4.1.3 Code Placement to Reduce Cache Misses
Code placement can greatly affect cache misses. One way to view the cache is to think
of it as 32 sets of 32 bytes, which span an address range of 1,024 bytes. When
running, the code maps into 32 blocks modular 1,024 of cache space. Any sets, which
are overused, will thrash the cache. The ideal situation is for the software tools to
distribute the code on a temporal evenness over this space.
This is very difficult if not impossible for a compiler to do. Most of the input needed to
best estimate how to distribute the code will come from profiling followed by compiler
based two pass optimizations.
3.10.4.1.4 Locking Code into the Instruction Cache
One very important instruction cache feature is the ability to lock code into the
instruction cache. Once locked into the instruction cache, the code is always available
for fast execution. Another reason for locking critical code into cache is that with the
round robin replacement policy, eventually the code will be evicted, even if it is a very
frequently executed function. Key code components to consider for locking are: