22007E/0 — November 1999

AMD Athlon™ Processor x86 Code Optimization

Newton-Raphson Reciprocal Square Root

The general Newton-Raphson reciprocal square root recurrence is:

Zi+1 = 1/2 Zi (3 – b Zi2)

To reduce the number of iterations, the initial approximation read from a table. The 3DNow! reciprocal square root approximation is accurate to at least 15 bits. Accordingly, to obtain a single-precision 24-bit reciprocal square root of an input operand b, one Newton-Raphson iteration is required, using the following sequence of 3DNow! instructions:

X0 = PFRSQRT(b)

X1 = PFMUL(X0,X0)

X2 = PFRSQIT1(b,X1)

X3 = PFRCPIT2(X2,X0)

X4 = PFMUL(b,X3)

The 24-bit final reciprocal square root value is X3. In the AMD Athlon processor 3DNow! implementation, the estimate contains the correct round-to-nearest value for approximately 87% of all arguments. The remaining arguments differ from the correct round-to-nearest value by one unit-in-the-last-place. The square root (X4) is formed in the last step by multiplying by the input operand b.

Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel

The MMX PMADDWD instruction can be used to perform two signed 16x1632 bit multiplies in parallel, with much higher performance than can be achieved using the IMUL instruction. The PMADDWD instruction is designed to perform four 16x1632 bit signed multiplies and accumulate the results pairwise. By making one of the results in a pair a zero, there are now just two multiplies. The following example shows how to multiply 16-bit signed numbers a,b,c,d into signed 32-bit products a×c and b×d:

Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel

111

Page 127
Image 127
AMD x86 manual Newton-Raphson Reciprocal Square Root