Speeding Up Branches Based on Comparisons Between Floats

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.27Speeding Up Branches Based on Comparisons Between Floats

Optimization

Store operands of type float into a memory location and use integer comparison with the memory location to perform fast branches in cases where compilers do not support fast floating-point comparison instructions or 3DNow! code generation.

Application

This optimization applies to 32-bit software.

Rationale

Branches based on floating-point comparisons are often slow. The AMD Athlon 64 and

AMD Opteron processors support the FCOMI, FUCOMI, FCOMIP, and FUCOMIP instructions that allow implementation of fast branches based on comparisons between operands of type double or type float. However, many compilers do not support generating these instructions. Likewise, floating-point comparisons between operands of type float can be accomplished quickly by using the 3DNow! PFCMP instruction if the compiler supports 3DNow! code generation.

Many compilers only implement branches based on floating-point comparisons by using FCOM or FCOMP to compare the floating-point operands, followed by FSTSW AX in order to transfer the x87 condition-code flags into EAX. The subsequent branch is then based on the contents of the EAX register. Although the AMD Athlon 64 and AMD Opteron processors have acceleration hardware to speed up the FSTSW instruction, this process is still fairly slow.

Branches Dependent on Integer Comparisons Are Fast

One alternative for branches dependent upon the outcome of the comparison of operands of type float is to store the operand(s) into a memory location and then perform an integer comparison with that memory location. Branches dependent on integer comparisons are very fast. It should be noted that the replacement code uses a load dependent on an immediately prior store. If the store is not doubleword-aligned, no store-to-load-forwarding takes place, and the branch is still slow. Also, if there is a lot of activity in the load-store queue, forwarding of the store data may be somewhat delayed, thus negating some of the advantages of using the replacement code. It is recommended that you experiment with the replacement code to test whether it actually provides a performance increase in the code at hand.

The replacement code works well for comparisons against zero, including correct behavior when encountering a negative zero as allowed by the IEEE-754 standard. It also works well for comparing

C and C++ Source-Level Optimizations

Chapter 2

AMD 250 manual Speeding Up Branches Based on Comparisons Between Floats

Models: 250