25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

fldz fldz fldz fldz

fld QWORD PTR [esi-128] ; Push B[0,j] onto stack.

fld QWORD PTR [edi-128] fmul st(0), st(1) faddp st(7), st(0)

fld

QWORD PTR [edi+eax-128]

fmul

st(0), st(1)

faddp

st(6), st(0)

fld

QWORD PTR [edi+eax*2-128]

fmul

st(0), st(1)

faddp

st(5), st(0)

fld

QWORD PTR [edi+ebx-128]

fmul

st(0), st(1)

faddp

st(4), st(0)

fld

QWORD PTR [edi+eax*4-128]

fmul

st(0), st(1)

faddp

st(3), st(0)

fmul

QWORD PTR [edi+ecx-128]

faddp

st(1), st(0)

;Push A[i,0] onto stack.

;Multiply A[i,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i and B’s column j.

;Push A[i+1,0] onto stack.

;Multiply A[i+1,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i+1 and B’s column j.

;Push A[i+2,0] onto stack.

;Multiply A[i+2,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i+2 and B’s column j.

;Push A[i+3,0] onto stack.

;Multiply A[i+3,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i+3 and B’s column j.

;Push A[i+4,0] onto stack.

;Multiply A[i+4,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i+4 and B’s column j.

;Multiply A[i+5,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i+5 and B’s column j.

The processor can execute the instructions in this code sequence out of order because the instructions are independent. Even though the loads and multiplies are performed sequentially, the floating-point scheduler can execute the FLD and FMUL instructions out of order in addition to the FADD instruction so as to keep the multiplier and adder pipes of the floating-point unit busy. B[0] is initially loaded into an x87 register and multiplied by the loaded elements of each row with the reg, reg form of FMUL to minimize the number of load operations that need to be performed. Additionally, the first element from the sixth row of A is not loaded but simply multiplied from memory by the loaded element of B[0]. This eliminates an FLD instruction and decreases the number of instructions in the instruction cache and the workload on the processor’s decoder. To achieve two floating-point operations per clock cycle, the number of floating-point operations should be twice the number of load-store operations. In the example above, there are 12 floating-point operations and seven operations requiring loads from memory, so nearly two floating-point operations can be performed per clock cycle.

Chapter 10

x87 Floating-Point Optimizations

241

Page 257
Image 257
AMD 250 manual 241