25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Avoid an assembly-language equivalent like this, which uses base and displacement components (for example, [esi+a]) to compute array-element addresses, requiring additional pointer arithmetic to increment the offsets into the forward-traversed arrays:

mov ecx, MAXSIZE

; Initialize loop counter.

xor esi, esi

; Initialize offset into array a.

xor edi, edi

; Initialize offset into array b.

xor ebx, ebx

; Initialize offset into array c.

add_loop:

 

 

mov eax, [esi+a]

; Get element from

a.

mov edx, [edi+b]

; Get element from

b.

add eax, edx

; a[i] + b[i]

 

mov [ebx+c], eax

; Write result to c.

add esi, 4

; Increment offset

into a.

add edi, 4

; Increment offset

into b.

add ebx, 4

; Increment offset

into c.

dec ecx

; Decrement loop count

jnz add_loop

; until loop count is 0.

Instead, traverse the arrays in a downward direction (from higher to lower addresses), in order to take advantage of scaled-index addressing (for example, [ecx*4+a]), which minimizes pointer arithmetic within the loop:

mov ecx, MAXSIZE - 1

; Initialize

index.

add_loop:

 

 

 

mov eax, [ecx*4+a]

; Get element from a.

mov edx, [ecx*4+b]

; Get element from b.

add eax, edx

; a[i] + b[i]

 

 

mov [ecx*4+c], eax

; Write result

to

c.

dec ecx

; Decrement index

 

jns add_loop

; until index

is

negative.

A change in the direction of traversal is possible only if each loop iteration is completely independent of the others. If you cannot change the direction of traversal for a given array, it is still possible to minimize pointer arithmetic by using as a base address a displacement that points to the byte past the end of the array, and using an index that starts with a negative value and reaches zero when the loop expires:

mov ecx, (-MAXSIZE)

; Initialize index.

add_loop:

 

 

mov eax, [ecx*4+a+MAXSIZE*4]

; Get element from a.

mov edx, [ecx*4+b+MAXSIZE*4]

; Get element from b.

add eax, edx

 

; a[i] + b[i]

mov [ecx*4+c+MAXSIZE*4], eax

; Write result to c.

inc ecx

 

; Increment index

jnz add_loop

 

; until index is 0.

Chapter 7

Scheduling Optimizations

155

Page 171
Image 171
AMD 250 manual 155