25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

For more information on write-combining, see Appendix B, “Implementation of Write-Combining.”

Multiple Prefetches

Programmers can initiate multiple outstanding prefetches on the AMD Athlon 64 and AMD Opteron processors. The AMD Athlon 64 and AMD Opteron processors can have a theoretical maximum of eight outstanding prefetches, but in practice the number is usually smaller. When all resources are filled by various memory read requests, the processor waits until resources become free before processing the next request. Multiple prefetch requests are essentially handled in order, prefetching data in the order that it is needed.

The following example shows how to initiate multiple prefetches when traversing more than one array.

Example—Multiple Prefetches

.CODE

.K3D

.686

;Original C code:

;#define LARGE_NUM 65536

;#define ARR_SIZE (LARGE_NUM*8)

;double array_a[LARGE_NUM];

;double array_b[LARGE_NUM];

;double array_c[LARGE_NUM];

;int i;

;

;for (i = 0; i < LARGE_NUM; i++) {

;a[i] = b[i] * c[i];

;}

mov edx, (-LARGE_NUM) mov eax, OFFSET array_a mov ebx, OFFSET array_b mov ecx, OFFSET array_c

;Use biased index.

;Get address of array_a.

;Get address of array_b.

;Get address of array_c.

loop:

 

 

 

 

 

prefetchw

[eax+256]

; Four cache lines ahead

prefetch

[ebx+256]

; Four

cache

lines

ahead

prefetch

[ecx+256]

; Four

cache

lines

ahead

fld

QWORD PTR [ebx+edx*8+ARR_SIZE]

; b[i]

 

fmul QWORD PTR [ecx+edx*8+ARR_SIZE]

; b[i] *

c[i]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE]

; a[i] =

b[i] * c[i]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+8]

; b[i+1]

 

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+8]

; b[i+1]

* c[i+1]

fstp

QWORD PTR [eax+edx*8+ARR_SIZE+8]

; a[i+1]

= b[i+1] * c[i+1]

fld

QWORD PTR [ebx+edx*8+ARR_SIZE+16]

; b[i+2]

 

fmul QWORD PTR [ecx+edx*8+ARR_SIZE+16]

; b[i+2]*c[i+2]

Chapter 5

Cache and Memory Optimizations

107

Page 123
Image 123
AMD 250 manual Multiple Prefetches, 107