25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.16Explicit Parallelism in Code

Optimization

Where possible, break long dependency chains into several independent dependency chains that can then be executed in parallel, exploiting the execution units in each pipeline.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale and Examples

This is especially important to break long dependency chains into smaller executing units in floating- point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules.

Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits.

Listing 10. Avoid

double a[100], sum; int i;

sum = 0.0f;

for (i = 0; i < 100; i++) { sum += a[i];

}

Chapter 2

C and C++ Source-Level Optimizations

35

Page 51
Image 51
AMD 250 manual Explicit Parallelism in Code, Rationale and Examples