General 64-Bit Optimizations

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.20 Replacing Integer Division with Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 2.21 Frequently Dereferenced Pointer Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 2.22 Array Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46 2.23 32-Bit Integral Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 2.24 Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 2.25 Accelerating Floating-Point Division and Square Root . . . . . . . . . . . . . . . . . . . . . . .50 2.26 Fast Floating-Point-to-Integer Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52 2.27 Speeding Up Branches Based on Comparisons Between Floats . . . . . . . . . . . . . . . .54 2.28 Improving Performance in Linux Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57

Chapter 3 General 64-Bit Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

3.1 64-Bit Registers and Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 3.2 64-Bit Arithmetic and Large-Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . .62 3.3 128-Bit Media Instructions and Floating-Point Operations . . . . . . . . . . . . . . . . . . . .67 3.4 32-Bit Legacy GPRs and Small Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . .68

Chapter 4 Instruction-Decoding Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

4.1 DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 4.2 Load-Execute Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

4.2.1 Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

4.2.2Load-ExecuteFloating-Point Instructions with Floating-Point Operands . . .74

4.2.3 Load-ExecuteFloating-Point Instructions with Integer Operands . . . . . . . . .74 4.3 Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . .77 4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes . . . . . . . . . . . . . .78 4.6 Short Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 4.7 Partial-Register Reads and Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 4.8 Using LEAVE for Function Epilogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 4.9 Alternatives to SHLD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85 4.10 8-BitSign-Extended Immediate Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87 4.11 8-BitSign-Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 4.12 Code Padding with Operand-Size Override and NOP . . . . . . . . . . . . . . . . . . . . . . . .89

Contents

AMD 250 manual General 64-Bit Optimizations

Models: 250

Chapter 3 General 64-Bit Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

Chapter 4 Instruction-Decoding Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71