36 C and C++ Source-Level Optimizations Chapter 2
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Listing 11. Preferred
double a[100], sum1, sum2, sum3, sum4, sum;
int i;
sum1 = 0.0;
sum2 = 0.0;
sum3 = 0.0;
sum4 = 0.0;
for (i = 0; i < 100; i + 4) {
sum1 += a[i];
sum2 += a[i+1];
sum3 += a[i+2];
sum4 += a[i+3];
}
sum = (sum4 + sum3) + (sum1 + sum2);
Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point
adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum
sustained utilization.