Software Optimization Guide for AMD64 Processors | 25112 Rev. 3.06 September 2005 |
Complete Loop Unrolling
Complete loop unrolling eliminates the loop overhead completely by replacing the loop with copies of the loop body.
Because complete loop unrolling removes the loop counter, it also reduces register pressure. However, completely unrolling very large loops can result in the inefficient use of the L1 instruction cache.
Example: Complete Loop Unrolling
In the following C code, the number of loop iterations is known at compile time and the loop body is less than 100 instructions:
#define ARRAY_LENGTH 3
int sum, i, a[ARRAY_LENGTH];
...
sum = 0;
for (i = 0; i < ARRAY_LENGTH; i++) { sum = sum + a[i];
}
To completely unroll an
sum = 0;
sum = sum + a[0]; sum = sum + a[1]; sum = sum + a[2];
Partial Loop Unrolling
Partial loop unrolling reduces the loop overhead by duplicating the loop body several times, changing
the increment in the loop, and adding cleanup code to execute any leftover iterations of the loop. The number of times the loop body is duplicated is known as the unroll factor.
However, partial loop unrolling may increase register pressure.
Example: Partial Loop Unrolling
In the following C code, each element of one array is added to the corresponding element of another array:
double a[MAX_LENGTH], b[MAX_LENGTH];
for (i = 0; i < MAX_LENGTH; i++) { a[i] = a[i] + b[i];
}
146 | Scheduling Optimizations | Chapter 7 |