Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+16]

; a[i+2] = [i+2] * c[i+2]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+24]

; b[i+3]

fmul

QWORD PTR [ecx+edx*8+ARR_SIZE+24]

; b[i+3] * c[i+3]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+24]

; a[i+3] = b[i+3] * c[i+3]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+32]

; b[i+4]

fmul

QWORD PTR [ecx+edx*8+ARR_SIZE+32]

; b[i+4] * c[i+4]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+32]

; a[i+4] = b[i+4] * c[i+4]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+40]

; b[i+5]

fmul

QWORD PTR [ecx+edx*8+ARR_SIZE+40]

; b[i+5] * c[i+5]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+40]

; a[i+5] = b[i+5] * c[i+5]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+48]

; b[i+6]

fmul

QWORD PTR [ecx+edx*8+ARR_SIZE+48]

; b[i+6] * c[i+6]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+48]

; a[i+6] = b[i+6] * c[i+6]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+56]

; b[i+7]

fmul

QWORD PTR [ecx+edx*8+ARR_SIZE+56]

; b[i+7] * c[i+7]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+56]

; a[i+7] = b[i+7] * c[i+7]

add

edx, 8

; Compute next 8 products

 

jnz

loop

; until none left.

 

END

The following optimization rules are applied to this example:

Partially unroll loops to ensure that the data stride per loop iteration is equal to the length of a cache line. This avoids overlapping PREFETCH instructions and thus makes optimal use of the available number of outstanding prefetches.

Because the array array_a is written rather than read, use PREFETCHW instead of PREFETCH to avoid overhead for switching cache lines to the correct state. The prefetch distance is optimized such that each loop iteration is working on three cache lines while active prefetches bring in the next cache lines.

Reduce index arithmetic to a minimum by use of complex addressing modes and biasing of the array base addresses in order to cut down on loop overhead.

Determining Prefetch Distance

When determining how far ahead to prefetch, the basic guideline is to initiate the prefetch early enough so that the data is in the cache by the time it is needed, under the constraint that there can not be more than eight prefetches in flight at any given time.

To determine the optimal prefetch distance, use empirical benchmarking when possible. Prefetching three or four cache lines ahead (192 or 256 bytes) is a good starting point and usually gives good results. Trying to prefetch too far ahead impairs performance.

Memory-Limited versus Processor-Limited Code

Software prefetching can help to hide the memory latency, but it can not increase the total memory bandwidth. Many loops are limited by memory bandwidth rather than processor speed, as shown in Figure 4. In these cases, the best that software prefetching can do is to ensure that enough memory

108

Cache and Memory Optimizations

Chapter 5

Page 124
Image 124
AMD 250 manual Determining Prefetch Distance, 108