Chapter X87 Floating-Point Optimizations 237

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

9.11Using SIMD Instructions for Fast Square Roots and Fast

Reciprocal Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

9.12Use XOR Operations to Negate Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215

9.13 Clearing MMX™ and XMM Registers with XOR Instructions . . . . . . . . . . . . . . . .216

9.14Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217

9.15Accumulating Single-PrecisionFloating-Point Numbers Using SSE, SSE2,

and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .218

9.16	Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions . .	. .221
9.17	Optimized 4 ⋅ 4 Matrix Multiplication on 4 ⋅ 1 Column Vector Routines	. .230
Chapter 10	x87 Floating-Point Optimizations	. .237
10.1	Using Multiplication Rather Than Division	. .238
10.2	Achieving Two Floating-Point Operations per Clock Cycle	. .239
10.3	Floating-Point Compare Instructions	. .244
10.4	Using the FXCH Instruction Rather Than FST/FLD Pairs	. .245
10.5	Floating-Point Subexpression Elimination	. .246
10.6	Accumulating Precision-Sensitive Quantities in x87 Registers	. .247
10.7	Avoiding Extended-Precision Data	. .248
Appendix A	Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors	. .249
A.1	Key Microarchitecture Features	. .250
A.2	Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . . . .	. .251
A.3	Superscalar Processor	. .251
A.4	Processor Block Diagram	. .251
A.5	L1 Instruction Cache	. .252
A.6	Branch-Prediction Table	. .253
A.7	Fetch-Decode Unit	. .254
A.8	Instruction Control Unit	. .254
A.9	Translation-Lookaside Buffer	. .254
A.10	L1 Data Cache	. .255
A.11	Integer Scheduler	. .256

Contents

vii

AMD 250 manual Chapter X87 Floating-Point Optimizations 237