AMD Athlon™ Processor x86 Code Optimization

22007E/0 — November 1999

Completely Unroll Small Loops

Take advantage of the AMD Athlon processor’s large, 64-Kbyte instruction cache and completely unroll small loops. Unrolling loops can be beneficial to performance, especially if the loop body is small which makes the loop overhead significant. Many compilers are not aggressive at unrolling loops. For loops that have a small fixed loop count and a small loop body, completely unrolling the loops at the source level is recommended.

Example 1 (Avoid):

//3D-transform: multiply vector V by 4x4 transform matrix M for (i=0; i<4; i++) {

r[i] = 0;

for (j=0; j<4; j++) { r[i] += M[j][i]*V[j];

}

}

Example 2 (Preferred):

//3D-transform: multiply vector V by 4x4 transform matrix M r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +

M[3][0]*V[3];

r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] + M[3][1]*V[3];

r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] + M[3][2]*V[3];

r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] + M[3][3]*v[3];

Avoid Unnecessary Store-to-Load Dependencies

Astore-to-load dependency exists when data is stored to memory, only to be read back shortly thereafter. See “Store-to-Load Forwarding Restrictions” on page 51 for more details. The AMD Athlon processor contains hardware to accelerate such store-to-load dependencies, allowing the load to obtain the store data before it has been written to memory. However, it is still faster to avoid such dependencies altogether and keep the data in an internal register.

Avoiding store-to-load dependencies is especially important if they are part of a long dependency chains, as might occur in a recurrence computation. If the dependency occurs while operating on arrays, many compilers are unable to optimize the

18

Completely Unroll Small Loops

Page 34
Image 34
AMD x86 manual Completely Unroll Small Loops, Avoid Unnecessary Store-to-Load Dependencies