25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

9.11Using SIMD Instructions for Fast Square Roots and Fast

Reciprocal Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

9.12Use XOR Operations to Negate Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215

9.13 Clearing MMX™ and XMM Registers with XOR Instructions . . . . . . . . . . . . . . . .216

9.14Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217

9.15Accumulating Single-PrecisionFloating-Point Numbers Using SSE, SSE2,

and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .218

9.16

Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions . .

. .221

9.17

Optimized 4 4 Matrix Multiplication on 4 1 Column Vector Routines

. .230

Chapter 10

x87 Floating-Point Optimizations

. .237

10.1

Using Multiplication Rather Than Division

. .238

10.2

Achieving Two Floating-Point Operations per Clock Cycle

. .239

10.3

Floating-Point Compare Instructions

. .244

10.4

Using the FXCH Instruction Rather Than FST/FLD Pairs

. .245

10.5

Floating-Point Subexpression Elimination

. .246

10.6

Accumulating Precision-Sensitive Quantities in x87 Registers

. .247

10.7

Avoiding Extended-Precision Data

. .248

Appendix A

Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors

. .249

A.1

Key Microarchitecture Features

. .250

A.2

Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . . . .

. .251

A.3

Superscalar Processor

. .251

A.4

Processor Block Diagram

. .251

A.5

L1 Instruction Cache

. .252

A.6

Branch-Prediction Table

. .253

A.7

Fetch-Decode Unit

. .254

A.8

Instruction Control Unit

. .254

A.9

Translation-Lookaside Buffer

. .254

A.10

L1 Data Cache

. .255

A.11

Integer Scheduler

. .256

Contents

vii

Page 7
Image 7
AMD 250 manual Chapter X87 Floating-Point Optimizations 237