Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.25Accelerating Floating-Point Division and Square Root

Optimization

In applications that involve the heavy use of single precision division and square root operations, it is recommended that you port the code to SSE or 3DNow!™ inline assembly or use a compiler that can generate SSE or 3DNow! technology code. If neither of these methods are possible, the x87 FPU

control word register precision control specification bits (PC) can be set to single precision to improve performance. (The processor defaults to double-extended precision. See AMD64 Architecture Programmer’s Manual Volume 1: Application Programming (order# 24592) for details on the FPU

control register.)

Application

This optimization applies to 32-bit software.

Rationale

Division and square root have a much longer latency than other floating-point operations, even though the AMD Athlon 64 and AMD Opteron processors provide significant acceleration of these two operations. In some application programs, these operations occur so often as to seriously impact performance. If code has hot spots that use single precision arithmetic only (that is, all computation involves data of type float) and for some reason cannot be ported to 3DNow! code, the following technique may be used to improve performance.

The x87 FPU has a precision-control field as part of the FPU control word. The precision-control setting determines rounding precision of instruction results and affects the basic arithmetic operations, including division and the extraction of square root. Division and square root on the AMD Athlon 64 and AMD Opteron processors are only computed to the number of bits necessary for the currently selected precision. Setting precision control to single precision (versus the Win32 default of double precision) lowers the latency of those operations.

The Microsoft® Visual C environment provides functions to manipulate the FPU control word and thus the precision control. Note that these functions are not very fast, so insert changes of precision control where it creates little overhead, such as outside a computation-intensive loop. Otherwise, the

overhead created by the function calls outweighs the benefit from reducing the latencies of divide and square-root operations. For more information on this topic, see AMD64 Architecture Programmer's Manual Volume 1: Application Programming (order# 24592).

The following example shows how to set the precision control to single precision and later restore the original settings in the Microsoft Visual C environment.

50

C and C++ Source-Level Optimizations

Chapter 2

Page 66
Image 66
AMD 250 manual Accelerating Floating-Point Division and Square Root