Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

7.3

Inline Functions

149

7.4

Address-Generation Interlocks

151

7.5

MOVZX and MOVSX

153

7.6

Pointer Arithmetic in Loops

154

7.7

Pushing Memory Data Directly onto the Stack

157

Chapter 8

Integer Optimizations

159

8.1

Replacing Division with Multiplication

160

8.2

Alternative Code for Multiplying by a Constant

164

8.3

Repeated String Instructions

167

8.4

Using XOR to Clear Integer Registers

169

8.5

Efficient 64-Bit Integer Arithmetic in 32-Bit Mode

170

8.6

Efficient Implementation of Population-Count Function in 32-Bit Mode

179

8.7

Efficient Binary-to-ASCII Decimal Conversion

181

8.8

Derivation of Algorithm, Multiplier, and Shift Factor for Integer

 

 

Division by Constants

186

8.9

Optimizing Integer Division

192

Chapter 9

Optimizing with SIMD Instructions

193

9.1

Ensure All Packed Floating-Point Data are Aligned

195

9.2Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and

MOVLPS When Loading Data from Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196

9.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access . . . . . . . . . . . . .198 9.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS . . . . . . . . . . .199 9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency . . . . . . . . . .200

9.6Avoid Moving Data Directly Between

General-Purpose and MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206

9.7Use MMX™ Instructions to Construct Fast Block-Copy

Routines in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207

9.8 Passing Data between MMX™ and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . .208 9.9 Storing Floating-Point Data in MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . .209 9.10 EMMS and FEMMS Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210

vi

Contents

Page 6
Image 6
AMD 250 manual Chapter Integer Optimizations 159