Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Complete Loop Unrolling

Complete loop unrolling eliminates the loop overhead completely by replacing the loop with copies of the loop body.

Because complete loop unrolling removes the loop counter, it also reduces register pressure. However, completely unrolling very large loops can result in the inefficient use of the L1 instruction cache.

Example: Complete Loop Unrolling

In the following C code, the number of loop iterations is known at compile time and the loop body is less than 100 instructions:

#define ARRAY_LENGTH 3

int sum, i, a[ARRAY_LENGTH];

...

sum = 0;

for (i = 0; i < ARRAY_LENGTH; i++) { sum = sum + a[i];

}

To completely unroll an n-iteration loop, remove the loop control and replicate the loop body n times:

sum = 0;

sum = sum + a[0]; sum = sum + a[1]; sum = sum + a[2];

Partial Loop Unrolling

Partial loop unrolling reduces the loop overhead by duplicating the loop body several times, changing

the increment in the loop, and adding cleanup code to execute any leftover iterations of the loop. The number of times the loop body is duplicated is known as the unroll factor.

However, partial loop unrolling may increase register pressure.

Example: Partial Loop Unrolling

In the following C code, each element of one array is added to the corresponding element of another array:

double a[MAX_LENGTH], b[MAX_LENGTH];

for (i = 0; i < MAX_LENGTH; i++) { a[i] = a[i] + b[i];

}

146

Scheduling Optimizations

Chapter 7

Page 162
Image 162
AMD 250 manual Example Complete Loop Unrolling, Example Partial Loop Unrolling, 146