Chapter 10 x87 Floating-Point Optimizations 245
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
10.4 Using the FXCH Instruction Rather Than FST/FLD Pairs
Optimization
Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains
simultaneously by explicitly switching execution between them.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Although the AMD Athlon 64 and AMD Opteron processor’s floating-point unit has a deep
scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency
chains can stall the scheduler while issue slots are still available. The maximum dependency chain
length that the scheduler can absorb is about six four-cycle instructions.
To switch execution between dependency chains, use of the FXCH instruction is recommended
because it has an apparent latency of zero cycles and generates only one micro-op. The floating-point
unit of the AMDAthlon 64 and AMD Opteron processors contains special hardware to handle up to
three FXCH instructions per cycle. Using FXCH is preferred over the use of FST/FLD pairs, even if
the FST/FLD pair works on a register. An FST/FLD pair adds two cycles of latency and consists of
two macro-ops.