AMD x86 manual Pipelined Pair of 24-Bit Precision Divides, Newton-Raphson Reciprocal

Models: x86

1 256
Download 256 pages 58.62 Kb
Page 125
Image 125

22007E/0 — November 1999

AMD Athlon™ Processor x86 Code Optimization

Pipelined Pair of 24-Bit Precision Divides

This divide operation executes with a total latency of 21 cycles, assuming that the program hides the latency of the first MOVD/MOVQ instructions within preceding code.

Example:

MOVQ

MM0, [DIVISORS]

;

y x

 

PFRCP

MM1, MM0

; 1/x 1/x

(approximate)

MOVQ

MM2, MM0

;

y x

 

PUNPCKHDQ

MM0, MM0

;

y y

 

PFRCP

MM0, MM0

; 1/y 1/y

(approximate)

PUNPCKLDQ

MM1, MM0

; 1/y 1/x

(approximate)

MOVQ

MM0, [DIVIDENDS]

;

z w

 

PFRCPIT1

MM2, MM1

; 1/y 1/x

(intermediate)

PFRCPIT2

MM2, MM1

; 1/y 1/x

(final)

PFMUL

MM0, MM2

; z/y w/x

 

Newton-Raphson Reciprocal

Consider the quotient q = a/b. An (on-chip) ROM-based table lookup can be used to quickly produce a 14-to-15-bit precision approximation of 1/b using just one PFRCP instruction. A full 24-bit precision reciprocal can then be quickly computed from this approximation using a Newton Raphson algorithm.

The general Newton-Raphson recurrence for the reciprocal is as follows:

Zi+1 = Zi (2 – b Zi)

Given that the initial approximation is accurate to at least 14 bits, and that a full IEEE single-precision mantissa contains 24 bits, just one Newton-Raphson iteration is required. The following sequence shows the 3DNow! instructions that produce the initial reciprocal approximation, compute the full precision reciprocal from the approximation, and finally, complete the desired divide of a/b.

X0 = PFRCP(b)

X1 = PFRCPIT1(b,X0)

X2 = PFRCPIT2(X1,X0)

q= PFMUL(a,X2)

The 24-bit final reciprocal value is X2. In the AMD Athlon processor 3DNow! technology implementation the operand X2 contains the correct round-to-nearest single precision reciprocal for approximately 99% of all arguments.

Use 3DNow!™ Instructions for Fast Division

109

Page 125
Image 125
AMD x86 manual Pipelined Pair of 24-Bit Precision Divides, Newton-Raphson Reciprocal