25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

10.2Achieving Two Floating-Point Operations per Clock Cycle

Optimization

Pay special attention to the order and packing of the operations to sustain up to two floating-point operations per clock cycle.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

The floating-point unit in the AMD Athlon, AMD Athlon 64, and AMD Opteron processors can sustain up to two floating-point operations per clock cycle. However, to achieve this, you must pay special attention to the order and packing of the operations. For example, consider multiplying a

30 30 double-precision matrix A by a transposed 30 30 double-precision matrix B, the result of which is called C.

Use Efficient Addressing of FPU Data When Loading and Storing

The rows of A are 240 bytes wide, as are the columns of B. There are eight x87 floating-point registers [ST(0)–ST(7)], and in this example, six rows of A are concurrently multiplied by a single column of B. The address of the first element of the first row of A (A[0]) is presumed to be stored in the EDI register, while the address of the first element of the first column of B (B[0]) is stored in ESI.

This addressing scheme might seem like a good idea, but it is not. Only 128 bytes can be addressed forward of A[0] with 8-bit offsets, meaning the size of the instructions are only 3 bytes (2 bytes for the instruction and 1 byte for the offset to the address stored in the general-purpose register). Upon offsetting more than 128 bytes from the address in the general-purpose register, the size of the instruction increases from 3 bytes to 6 bytes (offsets larger than 128 bytes are represented by 32 bits rather than 8 bits). Large instruction sizes reduce the number of decoded operations to be executed within the pipes of the floating-point unit, and as such prevent us from achieving two floating-point operations per clock cycle. To alleviate this, the general-purpose registers EDI and ESI are offset by 128 bytes such that they contain the addresses of A[15] and B[15]. This is beneficial because data within 128 bytes (16 double-precision numbers) before or after these two locations can now be accessed with instructions that are 2–3 bytes in size. The next five rows of A can be efficiently addressed in terms of the first row. Storing the size of a single row of A (240 bytes) in the EAX

Chapter 10

x87 Floating-Point Optimizations

239

Page 255
Image 255
AMD 250 manual Achieving Two Floating-Point Operations per Clock Cycle, 239