Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

The unrolled loop consists of 10 instructions. Based on the decode/retire bandwidth of three instructions per cycle, this loop goes no faster than three iterations in 10 cycles (which is equivalent to 6/10 of a floating-point add per cycle because there are two additions per iteration), or 1.4 times as fast as the original loop.

3 instructions

x

iteration

2 FADDs

 

6 FADDs

 

--------------------------------cycle

-----------------------------------

----------------------

=

----------------------10 cycles = 0.600 FADDs

cycle

10 instructionsx iteration

 

Deriving the Loop Control for Partially Unrolled Loops

A frequently used loop construct is a counting loop. In a typical case, the loop count starts at some lower bound (low), increases by some fixed, positive increment (inc) for each iteration of the loop, and may not exceed some upper bound (high):

for (k = low; k <= high; k += inc) { x[k] = ...

}

The following code shows how to partially unroll such a loop by an unroll factor (factor) and how to derive the loop control for the partially unrolled version of the loop:

for (k = low; k <= (high - (factor - 1) * inc); k += factor * inc) {

//Begin the series of unrolled statements. x[k + 0 * inc] = ...

//Continue the series if the unrolling factor is greater than 2. x[k + 1 * inc] = ...

x[k + 2 * inc] = ...

...

//End the series.

x[k + (factor - 1) * inc] = ...

}

// Handle the end cases.

for (k = k; k <= high; k += inc) { x[k] = ...

}

Related Information

For information on loop unrolling at the C-source level, see “Unrolling Small Loops” on page 13.

148

Scheduling Optimizations

Chapter 7

Page 164
Image 164
AMD 250 manual Deriving the Loop Control for Partially Unrolled Loops, 148