240

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

register, the size of three rows (720 bytes) in EBX, and the size of five rows (1200 bytes) in ECX, the first element of rows 0–5 of A can be addressed as follows:

fld QWORD PTR [edi-128]	; Load A[i,0].
fld QWORD PTR [edi+eax-128]	; Load A[i+1,0].
fld QWORD PTR [edi+eax*2-128]	; Load A[i+2,0].
fld QWORD PTR [edi+ebx-128]	; Load A[i+3,0].
fld QWORD PTR [edi+eax*4-128]	; Load A[i+4,0].
fld QWORD PTR [edi+ecx-128]	; Load A[i+5,0].

This addressing scheme reduces the size of all loads from memory to 3 bytes; additionally, to address rows 6–11 of A, you only need to add 240*6 to EDI.

Avoid Register Dependencies by Spacing Apart Instructions that Accumulate Results in a Register

The second general optimization to consider is spacing out register dependencies. Operations internally in the floating-point unit have an execution latency (normally 3–4 cycles for x87 operations). Consider this instruction sequence:

fldz

fld QWORD PTR [edi-128] fmul QWORD PTR [esi-128] faddp st(1), st(0)

fld QWORD PTR [edi-120] fmul QWORD PTR [esi-120] faddp st(1), st(0)

fld QWORD PTR [edi-112] fmul QWORD PTR [esi-112] faddp st(1), st(0)

;Push 0.0 onto floating-point stack.

;Push A[i,0] onto stack.

;Multiply A[i,0] by B[0,j].

;Accumulate contribution to dot product of

;A’s row i and B’s column j.

;Push A[i,1] onto stack.

;Multiply A[i,1] by B[1,j].

;Accumulate contribution to dot product of

;A’s row i and B’s column j.

;Push A[i,2] onto stack.

;Multiply A[i,2] by B[2,j].

;Accumulate contribution to dot product of

;A’s row i and B’s column j.

The second statement loads A[0] into ST(0), and the third statement multiplies it by B[0]. The subsequent line adds this product to ST(1), where the dot product of row 0 of matrix A and column 0 of matrix B is accumulated. Each of the subsequent groups of three instructions adds the contribution of the remaining 29 elements to the dot product. This code is poor because all the addition operations depend upon the contents of a single register, ST(1). The AMD Athlon, AMD Athlon 64 and AMD Opteron processors have out-of-order-execution floating-point units, but none of the addition operations can be performed out of order because the result of each addition operation depends on the outcome of the previous addition operation. Instruction scheduling based on this code greatly limits the throughput of the floating-point unit. To alleviate this, space out operations that are dependent on one another. In this case, work with six rows of A rather than one at a time, as follows:

;Multiply first element of each of six rows of A by first element of

;B’s column j.

fldz	; Push 0.0 six times onto floating-point stack.
fldz

x87 Floating-Point Optimizations

Chapter 10

AMD 250 manual 240

Models: 250

Software Optimization Guide for AMD64 Processors

240