AMD x86 manual

Contents

Main Page Contents 1 Introduction 1 2 Top Optimizations 7 3 C Source Level Optimizations 13 4 Instruction Decoding Optimizations 33 5 Cache and Memory Optimizations 45 6 Branch Optimizations 57 7 Scheduling Optimizations 67 8 Integer Optimizations 77 9 Floating-Point Optimizations 97 10 3DNow! and MMX Optimizations 107 11 General x86 Optimization Guidelines 127 Appendix A AMD AthlonProcessor Microarchitecture 129 Appendix B Pipeline and Execution Unit Resources Overview 141 Appendix C Implementation of Write Combining 155 Appendix D Performance-Monitoring Counters 161 Page List of Figures Page List of Tables Page Revision History xv Revision History Page Introduction About this Document Page AMD Athlon Processor Family AMD Athlon Processor Microarchitecture Summary Page Page Top Optimizations Optimization Star Group I Optimizations Essential Optimizations Memory Size and Alignment Issues Use the 3DNow! PREFETCH and P REFETCHW Instructions Select DirectPath Over VectorPath Instructions Group II OptimizationsSecondary Optimizations Load-Execute Instruction Usage Take Advantage of Write Combining Use 3DNow! Instructions Avoid Branches Dependent on Random Data Avoid Placing Code and Data in the Same 64-Byte Cache Line Page C Source Level Optimizations Ensure Floating-Point Variables and Expressions are of Type Float Use 32-Bit Data Types for Integer Code Consider the Sign of Integer Operands Use Array Style Instead of Pointer Style Code Page Use Array Style Instead of Pointer Style Code 17 Example 2 (Preferred): Completely Unroll Small Loops Avoid Unnecessary Store-to-Load Dependencies Page Consider Expression Order in Compound Branch Conditions Switch Statement Usage Optimize Switch Statements Use Prototypes for All Functions Use Const Type Qualifier Generic Loop Hoisting Generic Loop Hoisting 23 Generalization for Multiple Constant Control Code The above loop should be transformed into: Declare Local Functions as Static Dynamic Memory Allocation Consideration Introduce Explicit Parallelism into Code Explicitly Extract Common Subexpressions C Language Structure Component Considerations Sort Local Variables According to Base Type Size Accelerating Floating-Point Divides and Square Roots Page Avoid Unnecessary Integer Division Copy Frequently De-referenced Pointer Arguments to Local Variables 32 Copy Frequently De-referenced Pointer Arguments to Local Variables Example 1 (Avoid): Example 2 (Preferred): Instruction Decoding Overview Select DirectPath Over VectorPath Instructions Load-Execute Instruction Usage Use Load-Execute Integer Instructions Use Load-Execute Floating-Point Instructions with Floating-Point Operands Avoid Load-Execute Floating-Point Instructions with Integer Operands Align Branch Targets in Program Hot Spots Use Short Instruction Lengths Avoid Partial Register Reads and Writes Replace Certain SHLD Instructions with Alternative Code Use 8-Bit Sign-Extended Immediates Use 8-Bit Sign-Extended Displacements Code Padding Using Neutral Code Fillers Recommendations for the AMD Athlon Processor Recommendations for AMD-K6 Family and AMD Athlon Processor Blended Code 42 Code Padding Using Neutral Code Fillers Code Padding Using Neutral Code Fillers 43 Page Cache and Memory Memory Size and Alignment Issues Avoid Memory Size Mismatches Align Data Where Possible Use the 3DNow! PREFETCH and PREFETCHW Instructions Page 48 Use the 3DNow! PREFETCH and PREFETCHW Page Take Advantage of Write Combining Avoid Placing Code and Data in the Same 64-Byte Cache Line Store-to-Load Forwarding Restrictions Store-to-Load Forwarding PitfallsTrue Dependencies Page Page Summary of Store-to-Load Forwarding Pitfalls to Avoid Stack Alignment Considerations Align TBYTE Variables on Quadword Aligned Addresses C Language Structure Component Considerations Sort Variables According to Base Type Size Branch Optimizations Avoid Branches Dependent on Random Data AMD Athlon Processor Specific Code Blended AMD-K6 and AMD Athlon Processor Code Always Pair CALL and RETURN Replace Branches with Computation in 3DNow! Code Muxing Constructs Sample Code Translated into 3DNow! Code 62 Replace Branches with Computation in 3DNow! Code Example 2: C code: 3DNow! code: Example 3: C code: 3DNow! code: Replace Branches with Computation in 3DNow! Code 63 Example 4: C code: 3DNow! code: 64 Replace Branches with Computation in 3DNow! Code Example 5: C code: 3DNow! code: Avoid the Loop Instruction Avoid Far Control Transfer Instructions Avoid Recursive Functions Scheduling Optimizations Schedule Instructions According to their Latency Unrolling Loops Complete Loop Unrolling Partial Loop Unrolling Page Page Use Function Inlining Overview Always Inline Functions if Called from One Site Always Inline Functions with Fewer than 25 Machine Instructions Avoid Address Generation Interlocks Use MOVZX and MOVSX Minimize Pointer Arithmetic in Loops Page Push Memory Data Carefully Page Integer Optimizations Replace Divides with Multiplies Multiplication by Reciprocal (Division) Utility Unsigned Division by Multiplication of Constant Signed Division by Multiplication of Constant Page Use Alternative Code When Multiplying by a Constant 82 Use Alternative Code When Multiplying by a Constant Use MMX Instructions for Integer-Only Work Repeated String Instruction Usage Latency of Repeated String Instructions Guidelines for Repeated String Instructions Page Use XOR Instruction to Clear Integer Registers Efficient 64-Bit Integer Arithmetic Efficient 64-Bit Integer Arithmetic 87 Example 4 (Left shift): Example 5 (Right shift): Example 6 (Multiplication): 88 Efficient 64-Bit Integer Arithmetic Example 7 (Division): Efficient 64-Bit Integer Arithmetic 89 Example 8 (Remainder): 90 Efficient 64-Bit Integer Arithmetic Efficient Implementation of Population Count Function Page Derivation of Multiplier Used for Integer Division by Constants Unsigned Derivation for Algorithm, Multiplier, and Shift Factor 94 Derivation of Multiplier Used for Integer Division by Derivation of Multiplier Used for Integer Division by Constants 95 Signed Derivation for Algorithm, Multiplier, and Shift Factor The utility sdiv.exe was compiled using the following code. 96 Derivation of Multiplier Used for Integer Division by Floating-Point Optimizations Ensure All FPU Data is Aligned Use Multiplies Rather than Divides Use FFREE P Macro to Pop One Register from the FPU Stack Floating-Point Compare Instructions Use the FXCH Instruction Rather than FST/FLD Pairs Avoid Using Extended-Precision Data Minimize Floating-Point-to-Integer Conversions Page Page Floating-Point Subexpression Elimination Check Argument Range of Trigonometric Instructions Efficiently Page Take Advantage of the FSINCOS Instruction Page 3DNow! and MMX Use 3DNow! Instructions Use FEMM S Instruction Use 3DNow! Instructions for Fast Division Optimized 14-Bit Precision Divide Optimized Full 24-Bit Precision Divide Pipelined Pair of 24-Bit Precision Divides Newton-Raphson Reciprocal Use 3DNow! Instructions for Fast Square Root and Reciprocal Square Root Optimized 15-Bit Precision Square Root Optimized 24-Bit Precision Square Root Newton-Raphson Reciprocal Square Root Use MMX PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel 3DNow! and MMX Intra-Operand Swapping Fast Conversion of Signed Words to Floating-Point Use MMX PXOR to Negate 3DNow! Data Use MMX PCMP Instead of 3DNow! PFCM P Use MMX Instructions for Block Copies and Block Fills 116 Use MMX Instructions for Block Copies and Block Fills Page Use MMX PXOR to Clear All Bits in an MMX Register Use MMX PCM PEQD to Set All Bits in an MMX Register Use MMX PAND to Find Absolute Value in 3DNow! Code Optimized Matrix Multiplication 120 Optimized Matrix Multiplication Optimized Matrix Multiplication 121 Efficient 3D-Clipping Code Computation Using 3DNow! Instructions Use 3DNow! PAVGUSB for MPEG-2 Motion Compensation 124 Use 3DNow! PAVGUSB for MPEG-2 Motion Example 1 (Avoid): Stream of Packed Unsigned Bytes Complex Number Arithmetic General x86 Optimization Guidelines Short Forms Dependencies Register Operands Stack Allocation Appendix A AMD AthlonProcessor Microarchitecture AMD Athlon Processor Microarchitecture Superscalar Processor Instruction Cache Predecode Branch Prediction Early Decoding Instruction Control Unit Data Cache Integer Scheduler Integer Execution Unit MacroOPs MacroOPs Instruction Control Unit and Register Files Integer Scheduler (18-en try) Floating-Point Scheduler Floating-Point Execution Unit Load-Store Unit (LSU) Data Cache 2-way, 64Kbytes LSU Operand Buses L2 Cache Controller Write Combining AMD Athlon System Bus Page Appendix B Pipeline and Execution Unit Resources Overview Fetch and Decode Pipeline Stages 123 4 5 6 FETCH SCAN ALIGN1/ MECTL ALIGN2/ MEROM EDEC/ MEDEC IDEC Page Integer Pipeline Stages Instruction Control Unit and R egister Files MacroOPs MacroOPs Integer Scheduler (18-entry) Page Floating-Point Pipeline Stages Page Execution Unit Resources Ter min olo gy Integer Pipeline Operations Floating-Point Pipeline Operations Load/Store Pipeline Operations Code Sample Analysis Execution Unit Resources 153 Table 7. Sample 1 Integer Register Operations 154 Execution Unit Resources Table 8. Sample 2 Integer Register and Memory Load Operations Appendix C Implementation of Write Combining Write-Combining Definitions and Abbreviations What is Write Combining? Programming Details Write-Combining Operations 158 Write-Combining Operations Table 9. Write Combining Completion Events Sending Write-Buffer Data to the System Page Appendix D Performance-Monitoring Counters Overview Performance Counter Usage PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000hC001_0003h) Page 164 Performance Counter Usage Performance Counter Usage 165 Table 11. Performance-Monitoring Counters (Continued) 166 Performance Counter Usage Table 11. Performance-Monitoring Counters (Continued) PerfCtr[3:0] MSRs (MSR Addresses C001_0004hC001_0007h) Starting and Stopping the Performance-Monitoring Counters Event and Time-Stamp Monitoring Software Monitoring Counter Overflow Page Appendix E Programming the MTRR and PAT Memory Type Range Register (MTRR) Mechanism Page Page Page Page Page Page Attribute Table (PAT) Page Page Attribute Table (PAT) 179 Table 15. Effective Memory Type Based on PAT and MTRRs 180 Page Attribute Table (PAT) Table 16. Final Output Memory Types Page Attribute Table (PAT) 181 Table 16. Final Output Memory Types (Continued) 182 Page Attribute Table (PAT) MTRR Fixed-Range Register Format The memory types defined for memory segments defined in Page Page Page Attribute Table (PAT) 185 Page Appendix F Instruction Dispatch and Execution Resources Page Instruction Dispatch and Execution Resources 189 190 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 191 192 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 193 194 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 195 196 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 197 198 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 199 200 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 201 202 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 203 204 Instruction Dispatch and Execution Resources AMD Athlon Processor x86 Code Optimization 22007E/0November 1999 Instruction Dispatch and Execution Resources 205 206 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 207 208 Instruction Dispatch and Execution Resources Table 20. MMX Instructions Instruction Dispatch and Execution Resources 209 210 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 211 Tabl e 21. M MX Extensions 212 Instruction Dispatch and Execution Resources Table 22. Floating-Point Instructions Tabl e 21. M MX Extensions (Continued) Instruction Dispatch and Execution Resources 213 214 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 215 216 Instruction Dispatch and Execution Resources Instruction Dispatch and Execution Resources 217 Table 23. 3DNow! Instructions 218 Instruction Dispatch and Execution Resources Table 23. 3DNow! Instructions (Continued) Table 24. 3DNow! Extensions Appendix G DirectPath versus VectorPath Instructions Select DirectPath Over VectorPath Instructions DirectPath Instructions 220 DirectPath Instructions Table 25.DirectPath Integer Instructions DirectPath Instructions 221 222 DirectPath Instructions DirectPath Instructions 223 224 DirectPath Instructions DirectPath Instructions 225 Page DirectPath Instructions 227 Table 26.DirectPath MMX Instructions 228 DirectPath Instructions Table 27.DirectPath MMX Extensions DirectPath Instructions 229 Table 28.DirectPath Floating-Point Instructions Table 28.DirectPath Floating-Point Instructions Page VectorPath Instructions 231 VectorPath Instructions and Table31, VectorPath MMX Extensions, on page 234 page 235 Table 29.VectorPath Integer Instructions 232 VectorPath Instructions VectorPath Instructions 233 234 VectorPath Instructions Table 30.VectorPath MMX Instructions Table 31. VectorPath MMX Extensions VectorPath Instructions 235 Table 32.VectorPath Floating-Point Instructions Table 32.VectorPath Floating-Point Instructions (Continued) Page Index 237 22007E/0November 1999 AMD Athlon Processor x86 Code Optimization Index Numerics A B C L M N O P