vi Contents
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
7.3 Inline Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149
7.4 Address-Generation Interlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.5 MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153
7.6 Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154
7.7 Pushing Memory Data Directly onto the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Chapter 8 Integer Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159
8.1 Replacing Division with Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160
8.2 Alternative Code for Multiplying by a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3 Repeated String Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167
8.4 Using XOR to Clear Integer Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169
8.5 Efficient 64-Bit Integer Arithmetic in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . .170
8.6 Efficient Implementation of Population-Count Function in 32-Bit Mode . . . . . . . .179
8.7 Efficient Binary-to-ASCII Decimal Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . .181
8.8 Derivation of Algorithm, Multiplier, and Shift Factor for Integer
Division by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186
8.9 Optimizing Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192
Chapter 9 Optimizing with SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.1 Ensure All Packed Floating-Point Data are Aligned. . . . . . . . . . . . . . . . . . . . . . . . . 195
9.2 Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and
MOVLPS When Loading Data from Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access . . . . . . . . . . . . .198
9.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS . . . . . . . . . . .199
9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency . . . . . . . . . .200
9.6 Avoid Moving Data Directly Between
General-Purpose and MMX™ Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206
9.7 Use MMX™ Instructions to Construct Fast Block-Copy
Routines in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207
9.8 Passing Data between MMX™ and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . .208
9.9 Storing Floating-Point Data in MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . .209
9.10 EMMS and FEMMS Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210