AMD Athlon™ Processor x86 Code Optimization

22007E/0 — November 1999

unrolling reduces register pressure by removing the loop counter. To completely unroll a loop, remove the loop control and replicate the loop body N times. In addition, completely unrolling a loop increases scheduling opportunities.

Only unrolling very large code loops can result in the inefficient use of the L1 instruction cache. Loops can be unrolled completely, if all of the following conditions are true:

The loop is in a frequently executed piece of code.

The loop count is known at compile time.

The loop body, once unrolled, is less than 100 instructions, which is approximately 400 bytes of code.

Partial Loop Unrolling

Partial loop unrolling can increase register pressure, which can make it inefficient due to the small number of registers in the x86 architecture. However, in certain situations, partial unrolling can be efficient due to the performance gains possible. Partial loop unrolling should be considered if the following conditions are met:

Spare registers are available

Loop body is small, so that loop overhead is significant

Number of loop iterations is likely > 10

Consider the following piece of C code:

double a[MAX_LENGTH], b[MAX_LENGTH];

for (i=0; i< MAX_LENGTH; i++) { a[i] = a[i] + b[i];

}

Without loop unrolling, the code looks like the following:

68

Unrolling Loops

Page 84
Image 84
AMD x86 manual Partial Loop Unrolling