Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor
September 2006 DM
Order Number: 252480-006US 195
Intel XScale® Processor—Intel® IXP42X product line and IXC1100 control plane processors
3.10.5.1.2 Scheduling Load and Store Multiple (LDM/STM)
LDM and STM instructions have an issue latency of 2-20 cycles depending on the
number of registers being loaded or stored. The issue latency is typically two cycles
plus an additional cycle for each of the registers being loaded or stored assuming a
data cache hit. The instruction following an LDM would stall whether or not this
instruction depends on the results of the load. A LDRD or STRD instruction does not
suffer from this drawback (except when followed by a memory operation) and should
be used where possible. Consider the task of adding two 64-bit integer values. Assume
that the addresses of these values are aligned on an 8-byte boundary. This can be
achieved using the LDM instructions as shown below:
If the code were written as shown above, assuming all the accesses hit the cache, the
code would take 11 cycles to complete. Rewriting the code as shown below using LDRD
instruction would take only seven cycles to complete. The performance would increase
further if we can fill in other instructions after LDRD to reduce the stalls due to the
result latencies of the LDRD instructions.
Similarly, the code sequence shown below takes five cycles to complete.
.
The alternative version which is shown below would only take three cycles to complete.
3.10.5.2 Scheduling Data Processing Instructions
Most IXP42X product line and IXC1100 control plane processors’ data processing
instructions have a result latency of one cycle. This means that the current instruction
is able to use the result from the previous data processing instruction. However, the
result latency is two cycles if the current instruction needs to use the result of the
previous data processing instruction for a shift by immediate. As a result, the following
code segment would incur a one-cycle stall for the MOV instruction:
The code above can be rearranged as follows to remove the one-cycle stall:
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldm r0, {r2, r3}
ldm r1, {r4, r5}
adds r0, r2, r4
adc r1,r3, r5
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldrd r2, [r0]
ldrd r4, [r1]
adds r0, r2, r4
adc r1,r3, r5
stm r0, {r2, r3}
add r1, r1, #1
strd r2, [r0]
add r1, r1, #1
sub r6, r7, r8
add r1, r2, r3
mov r4, r1, LSL #2