Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Listing 11. Preferred

double a[100], sum1, sum2, sum3, sum4, sum; int i;

sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0;

for (i = 0; i < 100; i + 4) { sum1 += a[i];

sum2 += a[i+1]; sum3 += a[i+2]; sum4 += a[i+3];

}

sum = (sum4 + sum3) + (sum1 + sum2);

Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.

36

C and C++ Source-Level Optimizations

Chapter 2

Page 52
Image 52
AMD 250 manual Listing 11. Preferred