25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Rationale

Loads are served by the L1 data cache in program order, but the number of loads that the processor can perform in one cycle depends on whether a bank conflict exists between the loads:

When a bank conflict

Then the number of loads the processor can perform per cycle is

 

 

Exists

1

 

 

Does not exist

2

 

 

Therefore, pairing loads that do not have a bank conflict helps maximize load throughput.

Example

Avoid code like this, where two loads without a bank conflict are separated by other instructions:

fld

qword ptr

[eax]

fmul

qword ptr

[ebx]

faddp

st(3), st

 

fld

qword ptr

[eax+8]

fmul

qword ptr

[ebx+8]

faddp

st(2), st

 

Instead, rearrange the two loads so they appear as a pair:

fld

qword ptr

[eax]

fld

qword ptr

[eax+8]

fmul

qword ptr

[ebx+8]

faddp

st(2), st

 

fmul

qword ptr

[ebx]

faddp

st(3), st

 

Chapter 5

Cache and Memory Optimizations

115

Page 131
Image 131
AMD 250 manual Example, 115