Chapter 5 Cache and Memory Optimizations 115
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
Rationale
Loads are served by the L1 data cache in program order, but the number of loads that the processor
can perform in one cycle depends on whether a bank conflict exists between the loads:
Therefore, pairing loads that do not have a bank conflict helps maximize load throughput.
Example
Avoid code like this, where two loads without a bank conflict are separated by other instructions:
fld qword ptr [eax]
fmul qword ptr [ebx]
faddp st(3), st
fld qword ptr [eax+8]
fmul qword ptr [ebx+8]
faddp st(2), st
Instead, rearrange the two loads so they appear as a pair:
fld qword ptr [eax]
fld qword ptr [eax+8]
fmul qword ptr [ebx+8]
faddp st(2), st
fmul qword ptr [ebx]
faddp st(3), st
When a bank conflict Then the number of loads the processor can perform per cycle is
Exists 1
Does not exist 2