25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

10.4Using the FXCH Instruction Rather Than FST/FLD Pairs

Optimization

Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains simultaneously by explicitly switching execution between them.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

Although the AMD Athlon 64 and AMD Opteron processor’s floating-point unit has a deep scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency chains can stall the scheduler while issue slots are still available. The maximum dependency chain length that the scheduler can absorb is about six four-cycle instructions.

To switch execution between dependency chains, use of the FXCH instruction is recommended because it has an apparent latency of zero cycles and generates only one micro-op. The floating-point unit of the AMD Athlon 64 and AMD Opteron processors contains special hardware to handle up to three FXCH instructions per cycle. Using FXCH is preferred over the use of FST/FLD pairs, even if the FST/FLD pair works on a register. An FST/FLD pair adds two cycles of latency and consists of two macro-ops.

Chapter 10

x87 Floating-Point Optimizations

245

Page 261
Image 261
AMD 250 manual Using the Fxch Instruction Rather Than FST/FLD Pairs, 245