Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.17Optimized 4 4 Matrix Multiplication on 4 1

Column Vector Routines

Optimization

Transpose the rotation matrix to eliminate the need to accumulate floating-point values in an XMM register.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

The multiplication of a 4 4 matrix with a 4 1 vector is commonly used in 3-D graphics for geometric transformation (translating, scaling, rotating, and applying perspective to 3-D points represented in homogeneous coordinates). Efficiency in single-precision matrix multiplication can be enhanced by use of SIMD instructions to increase throughput, but there are other general optimizations that can be implemented to further increase performance. The first optimization is the transposition of the rotation matrix such that the column n of the matrix becomes the row n and the row m becomes the column m. This optimization does not benefit 3DNow! technology code (3DNow! technology has extended instructions that preclude the need for this optimization), but does benefit SSE code. There are no SSE or SSE2 instructions that accumulate the floats and doubles in a single XMM register; for this reason, the matrix must be transposed. If the rotation matrix is not transposed, then the dot-product of a row of the matrix with a column vector necessitates the accumulation of the four floating-point values in an XMM register. The multiplication upon the column vector is illustrated here:

 

 

r00 r01

r02

r03

r00 r10 r20

r30

v0

v'0

tr(R) x v = tr

r10 r11

r12

r13 x v =

r01 r11 r21

r31 x v1 = v'1

 

 

r20 r21

r22

r23

r02 r12 r22

r32

v2

v'2

 

 

r30 r31

r32

r33

r03 r13 r23

r33

v3

v'3

 

Step 0

 

Step 1

Step 2

Step 3

 

 

v'0

r00 x

v0

r01 x

v1 + r02

x v2 + r03

x v3

 

 

v'1 = r10 x

v0 + r11 x

v1 + r12

x v2 + r13

x v3

 

 

v'2

r20 x

v0

r21 x

v1 + r22

x v2 + r23

x v3

 

 

v'3

r30 x

v0

r31 x

v1 + r32

x v2 + r33

x v3

 

 

In each step above, the elements of the rotation matrix can be loaded into an XMM register with the MOVAPS instruction, assuming the rotation matrix begins at a 16-byte-aligned memory location. Transposition of the rotation matrix eliminates the need to accumulate the floating-point values in an

230

Optimizing with SIMD Instructions

Chapter 9

Page 246
Image 246
AMD 250 manual Optimization, 230