AMD Athlon™ Processor x86 Code Optimization

22007E/0 — November 1999

Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root

3DNow! instructions can be used to compute a very fast, highly accurate square root and reciprocal square root.

Optimized 15-Bit Precision Square Root

This square root operation can be executed in only 7 cycles, assuming a program hides the latency of the first MOVD instruction within previous code. The reciprocal square root operation requires four less cycles than the square root operation.

Example:

MOVD

MM0, [MEM]

;

0

a

 

PFRSQRT

MM1, MM0

;1/sqrt(a) 1/sqrt(a)

(approximate)

PUNPCKLDQ

MM0,

MM0

;

a

a

(MMX instr.)

PFMUL

MM0,

MM1

;

sqrt(a)

sqrt(a)

 

Optimized 24-Bit Precision Square Root

This square root operation can be executed in only 19 cycles, assuming a program hides the latency of the first MOVD instruction within previous code. The reciprocal square root operation requires four less cycles than the square root operation.

Example:

MOVD

MM0, [MEM]

;

0

a

 

PFRSQRT

MM1, MM0

; 1/sqrt(a) 1/sqrt(a)

(approx.)

MOVQ

MM2, MM1

;

X_0 = 1/(sqrt a)

(approx.)

PFMUL

MM1, MM1

; X_0 * X_0 X_0 * X_0

(step 1)

PUNPCKLDQ

MM0, MM0

;

a a

(MMX instr)

PFRSQIT1

MM1, MM0

;

(intermediate)

(step 2)

PFRCPIT2

MM1, MM2

; 1/sqrt(a) 1/sqrt(a)

(step 3)

PFMUL

MM0, MM1

;

sqrt(a) sqrt(a)

 

110

Use 3DNow!™ Instructions for Fast Square Root and

Page 126
Image 126
AMD x86 manual Optimized 15-Bit Precision Square Root, Optimized 24-Bit Precision Square Root