Intel 80200, Processor B.5 Instruction Scheduling

Developer’s Manual March, 2003 B-35

Intel® 80200 Processor based on Intel® XScale™ Microarchitecture

Optimization Guide

B.5 Instruction Scheduling

This chapter discusses instruction scheduling optimizations. Instruction scheduling refers to the

rearrangement of a sequence of instructions for the purpose of minimizing pipeline stalls. Reducing

the number of pipeline stalls improves application performance. While making this rearrangement,

care should be taken to ensure that the rearranged sequence of instructions has the same effect as

the original sequence of instructions.

B.5.1 Scheduling Loads

On the Intel® 80200 processor, an LDR instruction has a result latency of 3 cycles assuming the

data being loaded is in the data cache. If the instruction after the LDR needs to use the result of the

load, then it would stall for 2 cycles. If possible, the instructions surrounding the LDR instruction

should be rearranged

to avoid this stall. Consider the following example:

add r1, r2, r3

ldr r0, [r5]

add r6, r0, r1

sub r8, r2, r3

mul r9, r2, r3

In the code shown above, the ADD instruction following the LDR would stall for 2 cycles because

it uses the result of the load. The code can be rearranged as follows to prevent the stalls:

ldr r0, [r5]

add r1, r2, r3

sub r8, r2, r3

add r6, r0, r1

mul r9, r2, r3

Note that this rearrangement may not be always possible. Consider the following example:

cmp r1, #0

addne r4, r5, #4

subeq r4, r5, #4

ldr r0, [r4]

cmp r0, #10

In the example above, the LDR instruction cannot be moved before the ADDNE or the SUBEQ

instructions because the LDR instruction depends on the result of these instructions. Rewrite the

above code to make it run faster at the expense of increasing code size:

cmp r1, #0

ldrne r0, [r5, #4]

ldreq r0, [r5, #-4]

addne r4, r5, #4

subeq r4, r5, #4

cmp r0, #10

The optimized code takes six cycles to execute compared to the seven cycles taken by the

unoptimized version.

The result latency for an LDR instruction is significantly higher if the data being loaded is not in the

data cache. To minimize the number of pipeline stalls in such a situation the LDR instruction should

be moved as far away as possible from the instruction that uses result of the load. Note that this may

at times cause certain register values to be spilled to memory due to the increase in register pressure.

In such cases, use a preload instruction or a preload hint to ensure that the data access in the LDR

instruction hits the cache when it executes. A preload hint should be used in cases where we cannot be

sure whether the load instruction would be executed. A preload instruction should be used in cases

where we can be sure that the load instruction would be executed. Consider following code sample: