Intel IXC1100, IXP42X 3.10.5.2 Scheduling Data Processing Instructions

Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor

September 2006 DM

Order Number: 252480-006US 195

Intel XScale® Processor—Intel® IXP42X product line and IXC1100 control plane processors

3.10.5.1.2 Scheduling Load and Store Multiple (LDM/STM)

LDM and STM instructions have an issue latency of 2-20 cycles depending on the

number of registers being loaded or stored. The issue latency is typically two cycles

plus an additional cycle for each of the registers being loaded or stored assuming a

data cache hit. The instruction following an LDM would stall whether or not this

instruction depends on the results of the load. A LDRD or STRD instruction does not

suffer from this drawback (except when followed by a memory operation) and should

be used where possible. Consider the task of adding two 64-bit integer values. Assume

that the addresses of these values are aligned on an 8-byte boundary. This can be

achieved using the LDM instructions as shown below:

If the code were written as shown above, assuming all the accesses hit the cache, the

code would take 11 cycles to complete. Rewriting the code as shown below using LDRD

instruction would take only seven cycles to complete. The performance would increase

further if we can fill in other instructions after LDRD to reduce the stalls due to the

result latencies of the LDRD instructions.

Similarly, the code sequence shown below takes five cycles to complete.

The alternative version which is shown below would only take three cycles to complete.

3.10.5.2 Scheduling Data Processing Instructions

Most IXP42X product line and IXC1100 control plane processors’ data processing

instructions have a result latency of one cycle. This means that the current instruction

is able to use the result from the previous data processing instruction. However, the

result latency is two cycles if the current instruction needs to use the result of the

previous data processing instruction for a shift by immediate. As a result, the following

code segment would incur a one-cycle stall for the MOV instruction:

The code above can be rearranged as follows to remove the one-cycle stall:

; r0 contains the address of the value being copied

; r1 contains the address of the destination location

ldm r0, {r2, r3}

ldm r1, {r4, r5}

adds r0, r2, r4

adc r1,r3, r5

; r0 contains the address of the value being copied

; r1 contains the address of the destination location

ldrd r2, [r0]

ldrd r4, [r1]

adds r0, r2, r4

adc r1,r3, r5

stm r0, {r2, r3}

add r1, r1, #1

strd r2, [r0]

add r1, r1, #1

sub r6, r7, r8

add r1, r2, r3

mov r4, r1, LSL #2