Optimization, 230 | AMD 250 instruction

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.17Optimized 4 ⋅ 4 Matrix Multiplication on 4 ⋅ 1

Column Vector Routines

Optimization

Transpose the rotation matrix to eliminate the need to accumulate floating-point values in an XMM register.

Application

This optimization applies to:

•32-bit software

•64-bit software

Rationale

The multiplication of a 4 ⋅ 4 matrix with a 4 ⋅ 1 vector is commonly used in 3-D graphics for geometric transformation (translating, scaling, rotating, and applying perspective to 3-D points represented in homogeneous coordinates). Efficiency in single-precision matrix multiplication can be enhanced by use of SIMD instructions to increase throughput, but there are other general optimizations that can be implemented to further increase performance. The first optimization is the transposition of the rotation matrix such that the column n of the matrix becomes the row n and the row m becomes the column m. This optimization does not benefit 3DNow! technology code (3DNow! technology has extended instructions that preclude the need for this optimization), but does benefit SSE code. There are no SSE or SSE2 instructions that accumulate the floats and doubles in a single XMM register; for this reason, the matrix must be transposed. If the rotation matrix is not transposed, then the dot-product of a row of the matrix with a column vector necessitates the accumulation of the four floating-point values in an XMM register. The multiplication upon the column vector is illustrated here:


		r00 r01		r02	r03	r00 r10 r20		r30	v0	v'0
tr(R) x v = tr		r10 r11		r12	r13 x v =	r01 r11 r21		r31 x v1 = v'1
		r20 r21		r22	r23	r02 r12 r22		r32	v2	v'2
		r30 r31		r32	r33	r03 r13 r23		r33	v3	v'3
	Step 0		Step 1		Step 2		Step 3
v'0	r00 x	v0	r01 x		v1 + r02	x v2 + r03		x v3
v'1 = r10 x		v0 + r11 x			v1 + r12	x v2 + r13		x v3
v'2	r20 x	v0	r21 x		v1 + r22	x v2 + r23		x v3
v'3	r30 x	v0	r31 x		v1 + r32	x v2 + r33		x v3

In each step above, the elements of the rotation matrix can be loaded into an XMM register with the MOVAPS instruction, assuming the rotation matrix begins at a 16-byte-aligned memory location. Transposition of the rotation matrix eliminates the need to accumulate the floating-point values in an

230

Optimizing with SIMD Instructions

Chapter 9

AMD 250 manual Optimization, 230

Models: 250

Software Optimization Guide for AMD64 Processors

Optimization

Application

Rationale

230