148 Scheduling Optimizations Chapter 7
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
The unrolled loop consists of 10 instructions. Based on the decode/retire bandwidth of three
instructions per cycle, this loop goes no faster than three iterations in 10 cycles (which is equivalent to
6/10 of a floating-point add per cycle because there are two additions per iteration), or 1.4 times as
fast as the original loop.
Deriving the Loop Control for Partially Unrolled Loops
A frequently used loop construct is a counting loop. In a typical case, the loop count starts at some
lower bound (low), increases by some fixed, positive increment (inc) for each iteration of the loop,
and may not exceed some upper bound (high):
for (k = low; k <= high; k += inc) {
x[k] = ...
}
The following code shows how to partially unroll such a loop by an unroll factor (factor) and how to
derive the loop control for the partially unrolled version of the loop:
for (k = low; k <= (high - (factor - 1) * inc); k += factor * inc) {
// Begin the series of unrolled statements.
x[k + 0 * inc] = ...
// Continue the series if the unrolling factor is greater than 2.
x[k + 1 * inc] = ...
x[k + 2 * inc] = ...
...
// End the series.
x[k + (factor - 1) * inc] = ...
}
// Handle the end cases.
for (k = k; k <= high; k += inc) {
x[k] = ...
}
Related Information
For information on loop unrolling at the C-source level, see “Unrolling Small Loops” on page13.
3 instructions
cycle
--------------------------------xiteration
10 instructions
-----------------------------------x2 FADDs
iteration
-----------------------6 FADDs
10 cycles
-----------------------0.600 FADDs cycle==