22007E/0 — November 1999

AMD Athlon™ Processor x86 Code Optimization

Without Loop Unrolling:

MOV ECX, MAX_LENGTH

MOV EAX, OFFSET A

MOV EBX, OFFSET B

$add_loop:

FLD

QWORD PTR [EAX]

FADD

QWORD PTR [EBX]

FSTP

QWORD PTR [EAX]

ADD

EAX, 8

ADD

EBX, 8

DEC

ECX

JNZ

$add_loop

The loop consists of seven instructions. The AMD Athlon processor can decode/retire three instructions per cycle, so it cannot execute faster than three iterations in seven cycles, or 3/7 floating-point adds per cycle. However, the pipelined floating-point adder allows one add every cycle. In the following code, the loop is partially unrolled by a factor of two, which creates potential endcases that must be handled outside the loop:

With Partial Loop Unrolling:

MOV

ECX, MAX_LENGTH

MOV

EAX, offset A

MOV

EBX, offset B

SHR

ECX, 1

JNC

$add_loop

FLD

QWORD PTR [EAX]

FADD

QWORD PTR [EBX]

FSTP

QWORD PTR [EAX]

ADD

EAX, 8

ADD

EBX, 8

$add_loop:

FLD

QWORD PTR[EAX]

FADD

QWORD PTR[EBX]

FSTP

QWORD PTR[EAX]

FLD

QWORD PTR[EAX+8]

FADD

QWORD PTR[EBX+8]

FSTP

QWORD PTR[EAX+8]

ADD

EAX, 16

ADD

EBX, 16

DEC

ECX

JNZ

$add_loop

Now the loop consists of 10 instructions. Based on the decode/retire bandwidth of three OPs per cycle, this loop goes

Unrolling Loops

69

Page 85
Image 85
AMD x86 manual Without Loop Unrolling, With Partial Loop Unrolling