Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor
September 2006 DM
Order Number: 252480-006US 191
Intel XScale® Processor—Intel® IXP42X product line and IXC1100 control plane processors
3.10.4.4.11 Prefetch to Reduce Register Pressure
Pre-fetch can be used to reduce register pressure. When data is needed for an
operation, then the load is scheduled far enough in advance to hide the load latency.
However, the load ties up the receiving register until the data can be used. For
example:
In the above case, r2 is unavailable for processing until the add statement. Prefetching
the data load frees the register for use. The example code becomes:
With the added prefetch, register r2 can be used for other operations until almost just
before it is needed.
3.10.5 Instruction Scheduling
This section discusses instruction scheduling optimizations. Instruction scheduling
refers to the rearrangement of a sequence of instructions for the purpose of minimizing
pipeline stalls. Reducing the number of pipeline stalls improves application
performance. While making this rearrangement, care should be taken to ensure that
the rearranged sequence of instructions has the same effect as the original sequence of
instructions.

3.10.5.1 Scheduling Loads

On the IXP42X product line and IXC1100 control plane processors, an LDR instruction
has a result latency of three cycles assuming the data being loaded is in the data
cache. If the instruction after the LDR needs to use the result of the load, then it would
stall for 2 cycles. If possible, the instructions surrounding the LDR instruction should be
rearranged.
to avoid this stall. Consider the following example:
In the code shown above, the ADD instruction following the LDR would stall for two
cycles because it uses the result of the load. The code can be rearranged as follows to
prevent the stalls:
ldr r2, [r0]
; Process code { not yet cached latency > 60 core clocks }
add r1, r1, r2
pld [r0] ;prefetch the data keeping r2 available for use
; Process code
ldr r2, [r0]
; Process code { ldr result latency is 3 core clocks }
add r1, r1, r2
add r1, r2, r3
ldr r0, [r5]
add r6, r0, r1
sub r8, r2, r3
mul r9, r2, r3