25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Without loop unrolling, this is the equivalent assembly-language code:

mov ecx, MAX_LENGTH mov eax, OFFSET a mov ebx, OFFSET b

add_loop:

fld

QWORD PTR [eax]

fadd

QWORD PTR [ebx]

fstp

QWORD PTR [eax]

add

eax, 8

add

ebx, 8

dec

ecx

jnz

add_loop

;Initialize counter.

;Load address of array a into EAX.

;Load address of array b into EBX.

;Push object pointed to by EAX onto the FP stack.

;Add object pointed to by EBX to ST(0).

;Copy ST(0) to object pointed to by EAX; pop ST(0).

;Point to next element of array a.

;Point to next element of array b.

;Decrement counter.

;If elements remain, then jump.

The rolled loop consists of seven instructions. AMD Athlon 64 and AMD Opteron processors can decode and retire as many as three instructions per cycle, so it cannot execute faster than three iterations in seven cycles (3/7 of a floating-point add per cycle). However, the pipelined floating-point adder allows one add every cycle.

3 instructions

x

iteration

1 FADD

=

3 FADDs

= 0.429 FADDs

 

cycle

cycle

7 instructionsxiteration

----------------------7 cycles

--------------------------------

 

--------------------------------

----------------------

 

 

 

 

After partial loop unrolling using an unroll factor of two, the new code creates a potential end case that must be handled outside the loop:

mov ecx, MAX_LENGTH mov eax, OFFSET a mov ebx, OFFSET b

shr ecx, 1 jnc add_loop

;Handle the end case. fld QWORD PTR [eax] fadd QWORD PTR [ebx] fstp QWORD PTR [eax] add eax, 8

add ebx, 8

add_loop:

fld

QWORD PTR [eax]

fadd

QWORD PTR [ebx]

fstp

QWORD PTR [eax]

fld

QWORD PTR [eax+8]

fadd

QWORD PTR [ebx+8]

fstp

QWORD PTR [eax+8]

add

eax, 16

add

ebx, 16

dec

ecx

jnz

add_loop

;Initialize counter.

;Load address of array a into EAX.

;Load address of array b into EBX.

;Divide counter by 2 (the unroll factor).

;If original counter was even, then jump.

;Push object pointed to by EAX onto the FP stack.

;Add object pointed to by EBX to ST(0).

;Copy ST(0) to object pointed to by EAX; pop ST(0).

;Point to next element of array a.

;Point to next element of array b.

;Push object pointed to by EAX onto the FP stack.

;Add object pointed to by EBX to ST(0).

;Copy ST(0) to object pointed to by EAX; pop ST(0).

;Repeat for next element.

;Point to next element of array a.

;Point to next element of array b.

;Decrement counter.

;If elements remain, then jump.

Chapter 7

Scheduling Optimizations

147

Page 163
Image 163
AMD 250 manual Fadd, 147