AMD Athlon™ Processor x86 Code Optimization

22007E/0 — November 1999

7

Scheduling Optimizations

67

Schedule Instructions According to their Latency . . . . . . . . . . . . . . 67 Unrolling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Complete Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Partial Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Use Function Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Always Inline Functions if Called from One Site . . . . . . . . . . 72

Always Inline Functions with Fewer than 25 Machine

Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Avoid Address Generation Interlocks. . . . . . . . . . . . . . . . . . . . . . . . . 72

Use MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Minimize Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . 73

Push Memory Data Carefully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8

Integer Optimizations

77

Replace Divides with Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Multiplication by Reciprocal (Division) Utility . . . . . . . . . . . 77 Unsigned Division by Multiplication of Constant. . . . . . . . . . 78 Signed Division by Multiplication of Constant . . . . . . . . . . . . 79 Use Alternative Code When Multiplying by a Constant. . . . . . . . . . 81

Use MMX™ Instructions for Integer-Only Work . . . . . . . . . . . . . . . . 83 Repeated String Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Latency of Repeated String Instructions . . . . . . . . . . . . . . . . . 84 Guidelines for Repeated String Instructions . . . . . . . . . . . . . 84 Use XOR Instruction to Clear Integer Registers . . . . . . . . . . . . . . . . 86

Efficient 64-Bit Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Efficient Implementation of Population Count Function . . . . . . . . . 91

Derivation of Multiplier Used for Integer Division

by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Unsigned Derivation for Algorithm, Multiplier, and

Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

vi

Contents

Page 6
Image 6
AMD x86 manual Scheduling Optimizations