54 C and C++ Source-Level Optimizations Chapter 2
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
2.27 Speeding Up Branches Based on Comparisons Between Floats
Optimization
Store operands of type float into a memory location and use integer comparison with the memory
location to perform fast branches in cases where compilers do not support fast floating-point
comparison instructions or 3DNow! code generation.
Application
This optimization applies to 32-bit software.
Rationale
Branches based on floating-point comparisons are often slow. The AMDAthlon 64 and
AMD Opteron processors support the FCOMI, FUCOMI, FCOMIP, and FUCOMIP instructions that
allow implementation of fast branches based on comparisons between operands of type double or
type float. However, many compilers do not support generating these instructions. Likewise,
floating-point comparisons between operands of type float can be accomplished quickly by using
the 3DNow! PFCMP instruction if the compiler supports 3DNow! code generation.
Many compilers only implement branches based on floating-point comparisons by using FCOM or
FCOMP to compare the floating-point operands, followed by FSTSW AX in order to transfer the x87
condition-code flags into EAX. The subsequent branch is then based on the contents of the EAX
register. Although the AMDAthlon 64 and AMD Opteron processors have acceleration hardware to
speed up the FSTSW instruction, this process is still fairly slow.
Branches Dependent on Integer Comparisons Are Fast
One alternative for branches dependent upon the outcome of the comparison of operands of type
float is to store the operand(s) into a memory location and then perform an integer comparison with
that memory location. Branches dependent on integer comparisons are very fast. It should be noted
that the replacement code uses a load dependent on an immediately prior store. If the store is not
doubleword-aligned, no store-to-load-forwarding takes place, and the branch is still slow. Also, if
there is a lot of activity in the load-store queue, forwarding of the store data may be somewhat
delayed, thus negating some of the advantages of using the replacement code. It is recommended that
you experiment with the replacement code to test whether it actually provides a performance increase
in the code at hand.
The replacement code works well for comparisons against zero, including correct behavior when
encountering a negative zero as allowed by the IEEE-754 standard. It also works well for comparing