Manuals
/
Brands
/
Computer Equipment
/
Typewriter
/
AMD
/
Computer Equipment
/
Typewriter
AMD
x86 manual
Please, tick the box below to download manual:
Contents
Main
Page
Contents
1 Introduction 1
2 Top Optimizations 7
3 C Source Level Optimizations 13
4 Instruction Decoding Optimizations 33
5 Cache and Memory Optimizations 45
6 Branch Optimizations 57
7 Scheduling Optimizations 67
8 Integer Optimizations 77
9 Floating-Point Optimizations 97
10 3DNow! and MMX Optimizations 107
11 General x86 Optimization Guidelines 127
Appendix A AMD AthlonProcessor Microarchitecture 129
Appendix B Pipeline and Execution Unit Resources Overview 141
Appendix C Implementation of Write Combining 155
Appendix D Performance-Monitoring Counters 161
Page
List of Figures
Page
List of Tables
Page
Revision History xv
Revision History
Page
Introduction
About this Document
Page
AMD Athlon Processor Family
AMD Athlon Processor Microarchitecture Summary
Page
Page
Top Optimizations
Optimization Star
Group I Optimizations Essential Optimizations
Memory Size and Alignment Issues
Use the 3DNow! PREFETCH and P REFETCHW Instructions
Select DirectPath Over VectorPath Instructions
Group II OptimizationsSecondary Optimizations
Load-Execute Instruction Usage
Take Advantage of Write Combining
Use 3DNow! Instructions
Avoid Branches Dependent on Random Data
Avoid Placing Code and Data in the Same 64-Byte Cache Line
Page
C Source Level Optimizations
Ensure Floating-Point Variables and Expressions are of Type Float
Use 32-Bit Data Types for Integer Code
Consider the Sign of Integer Operands
Use Array Style Instead of Pointer Style Code
Page
Use Array Style Instead of Pointer Style Code 17
Example 2 (Preferred):
Completely Unroll Small Loops
Avoid Unnecessary Store-to-Load Dependencies
Page
Consider Expression Order in Compound Branch Conditions
Switch Statement Usage
Optimize Switch Statements
Use Prototypes for All Functions
Use Const Type Qualifier
Generic Loop Hoisting
Generic Loop Hoisting 23
Generalization for Multiple Constant Control Code
The above loop should be transformed into:
Declare Local Functions as Static
Dynamic Memory Allocation Consideration
Introduce Explicit Parallelism into Code
Explicitly Extract Common Subexpressions
C Language Structure Component Considerations
Sort Local Variables According to Base Type Size
Accelerating Floating-Point Divides and Square Roots
Page
Avoid Unnecessary Integer Division
Copy Frequently De-referenced Pointer Arguments to Local Variables
32 Copy Frequently De-referenced Pointer Arguments to Local Variables
Example 1 (Avoid):
Example 2 (Preferred):
Instruction Decoding
Overview
Select DirectPath Over VectorPath Instructions
Load-Execute Instruction Usage
Use Load-Execute Integer Instructions
Use Load-Execute Floating-Point Instructions with Floating-Point Operands
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Align Branch Targets in Program Hot Spots
Use Short Instruction Lengths
Avoid Partial Register Reads and Writes
Replace Certain SHLD Instructions with Alternative Code
Use 8-Bit Sign-Extended Immediates
Use 8-Bit Sign-Extended Displacements
Code Padding Using Neutral Code Fillers
Recommendations for the AMD Athlon Processor
Recommendations for AMD-K6 Family and AMD Athlon Processor Blended Code
42 Code Padding Using Neutral Code Fillers
Code Padding Using Neutral Code Fillers 43
Page
Cache and Memory
Memory Size and Alignment Issues
Avoid Memory Size Mismatches
Align Data Where Possible
Use the 3DNow! PREFETCH and PREFETCHW Instructions
Page
48 Use the 3DNow! PREFETCH and PREFETCHW
Page
Take Advantage of Write Combining
Avoid Placing Code and Data in the Same 64-Byte Cache Line
Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding PitfallsTrue Dependencies
Page
Page
Summary of Store-to-Load Forwarding Pitfalls to Avoid
Stack Alignment Considerations
Align TBYTE Variables on Quadword Aligned Addresses
C Language Structure Component Considerations
Sort Variables According to Base Type Size
Branch Optimizations
Avoid Branches Dependent on Random Data
AMD Athlon Processor Specific Code
Blended AMD-K6 and AMD Athlon Processor Code
Always Pair CALL and RETURN
Replace Branches with Computation in 3DNow! Code
Muxing Constructs
Sample Code Translated into 3DNow! Code
62 Replace Branches with Computation in 3DNow! Code
Example 2: C code:
3DNow! code:
Example 3: C code:
3DNow! code:
Replace Branches with Computation in 3DNow! Code 63
Example 4: C code:
3DNow! code:
64 Replace Branches with Computation in 3DNow! Code
Example 5: C code:
3DNow! code:
Avoid the Loop Instruction
Avoid Far Control Transfer Instructions
Avoid Recursive Functions
Scheduling Optimizations
Schedule Instructions According to their Latency
Unrolling Loops
Complete Loop Unrolling
Partial Loop Unrolling
Page
Page
Use Function Inlining
Overview
Always Inline Functions if Called from One Site
Always Inline Functions with Fewer than 25 Machine Instructions
Avoid Address Generation Interlocks
Use MOVZX and MOVSX
Minimize Pointer Arithmetic in Loops
Page
Push Memory Data Carefully
Page
Integer Optimizations
Replace Divides with Multiplies
Multiplication by Reciprocal (Division) Utility
Unsigned Division by Multiplication of Constant
Signed Division by Multiplication of Constant
Page
Use Alternative Code When Multiplying by a Constant
82 Use Alternative Code When Multiplying by a Constant
Use MMX Instructions for Integer-Only Work
Repeated String Instruction Usage
Latency of Repeated String Instructions
Guidelines for Repeated String Instructions
Page
Use XOR Instruction to Clear Integer Registers
Efficient 64-Bit Integer Arithmetic
Efficient 64-Bit Integer Arithmetic 87
Example 4 (Left shift):
Example 5 (Right shift):
Example 6 (Multiplication):
88 Efficient 64-Bit Integer Arithmetic
Example 7 (Division):
Efficient 64-Bit Integer Arithmetic 89
Example 8 (Remainder):
90 Efficient 64-Bit Integer Arithmetic
Efficient Implementation of Population Count Function
Page
Derivation of Multiplier Used for Integer Division by Constants
Unsigned Derivation for Algorithm, Multiplier, and Shift Factor
94 Derivation of Multiplier Used for Integer Division by
Derivation of Multiplier Used for Integer Division by Constants 95
Signed Derivation for Algorithm, Multiplier, and Shift Factor
The utility sdiv.exe was compiled using the following code.
96 Derivation of Multiplier Used for Integer Division by
Floating-Point Optimizations
Ensure All FPU Data is Aligned
Use Multiplies Rather than Divides
Use FFREE P Macro to Pop One Register from the FPU Stack
Floating-Point Compare Instructions
Use the FXCH Instruction Rather than FST/FLD Pairs
Avoid Using Extended-Precision Data
Minimize Floating-Point-to-Integer Conversions
Page
Page
Floating-Point Subexpression Elimination
Check Argument Range of Trigonometric Instructions Efficiently
Page
Take Advantage of the FSINCOS Instruction
Page
3DNow! and MMX
Use 3DNow! Instructions
Use FEMM S Instruction
Use 3DNow! Instructions for Fast Division
Optimized 14-Bit Precision Divide
Optimized Full 24-Bit Precision Divide
Pipelined Pair of 24-Bit Precision Divides
Newton-Raphson Reciprocal
Use 3DNow! Instructions for Fast Square Root and Reciprocal Square Root
Optimized 15-Bit Precision Square Root
Optimized 24-Bit Precision Square Root
Newton-Raphson Reciprocal Square Root
Use MMX PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel
3DNow! and MMX Intra-Operand Swapping
Fast Conversion of Signed Words to Floating-Point
Use MMX PXOR to Negate 3DNow! Data
Use MMX PCMP Instead of 3DNow! PFCM P
Use MMX Instructions for Block Copies and Block Fills
116 Use MMX Instructions for Block Copies and Block Fills
Page
Use MMX PXOR to Clear All Bits in an MMX Register
Use MMX PCM PEQD to Set All Bits in an MMX Register
Use MMX PAND to Find Absolute Value in 3DNow! Code
Optimized Matrix Multiplication
120 Optimized Matrix Multiplication
Optimized Matrix Multiplication 121
Efficient 3D-Clipping Code Computation Using 3DNow! Instructions
Use 3DNow! PAVGUSB for MPEG-2 Motion Compensation
124 Use 3DNow! PAVGUSB for MPEG-2 Motion
Example 1 (Avoid):
Stream of Packed Unsigned Bytes
Complex Number Arithmetic
General x86 Optimization Guidelines
Short Forms
Dependencies
Register Operands
Stack Allocation
Appendix A
AMD AthlonProcessor Microarchitecture
AMD Athlon Processor Microarchitecture
Superscalar Processor
Instruction Cache
Predecode
Branch Prediction
Early Decoding
Instruction Control Unit
Data Cache
Integer Scheduler
Integer Execution Unit
MacroOPs MacroOPs
Instruction Control Unit and Register Files
Integer Scheduler (18-en try)
Floating-Point Scheduler
Floating-Point Execution Unit
Load-Store Unit (LSU)
Data Cache 2-way, 64Kbytes
LSU
Operand Buses
L2 Cache Controller
Write Combining
AMD Athlon System Bus
Page
Appendix B
Pipeline and Execution Unit Resources Overview
Fetch and Decode Pipeline Stages
123 4 5 6
FETCH SCAN ALIGN1/ MECTL
ALIGN2/ MEROM
EDEC/ MEDEC
IDEC
Page
Integer Pipeline Stages
Instruction Control Unit and R egister Files
MacroOPs MacroOPs
Integer Scheduler (18-entry)
Page
Floating-Point Pipeline Stages
Page
Execution Unit Resources
Ter min olo gy
Integer Pipeline Operations
Floating-Point Pipeline Operations
Load/Store Pipeline Operations
Code Sample Analysis
Execution Unit Resources 153
Table 7. Sample 1 Integer Register Operations
154 Execution Unit Resources
Table 8. Sample 2 Integer Register and Memory Load Operations
Appendix C
Implementation of Write Combining
Write-Combining Definitions and Abbreviations
What is Write Combining?
Programming Details
Write-Combining Operations
158 Write-Combining Operations
Table 9. Write Combining Completion Events
Sending Write-Buffer Data to the System
Page
Appendix D
Performance-Monitoring Counters
Overview
Performance Counter Usage
PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000hC001_0003h)
Page
164 Performance Counter Usage
Performance Counter Usage 165
Table 11. Performance-Monitoring Counters (Continued)
166 Performance Counter Usage
Table 11. Performance-Monitoring Counters (Continued)
PerfCtr[3:0] MSRs (MSR Addresses C001_0004hC001_0007h)
Starting and Stopping the Performance-Monitoring Counters
Event and Time-Stamp Monitoring Software
Monitoring Counter Overflow
Page
Appendix E
Programming the MTRR and PAT
Memory Type Range Register (MTRR) Mechanism
Page
Page
Page
Page
Page
Page Attribute Table (PAT)
Page
Page Attribute Table (PAT) 179
Table 15. Effective Memory Type Based on PAT and MTRRs
180 Page Attribute Table (PAT)
Table 16. Final Output Memory Types
Page Attribute Table (PAT) 181
Table 16. Final Output Memory Types (Continued)
182 Page Attribute Table (PAT)
MTRR Fixed-Range Register Format The memory types defined for memory segments defined in
Page
Page
Page Attribute Table (PAT) 185
Page
Appendix F
Instruction Dispatch and Execution Resources
Page
Instruction Dispatch and Execution Resources 189
190 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 191
192 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 193
194 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 195
196 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 197
198 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 199
200 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 201
202 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 203
204 Instruction Dispatch and Execution Resources
AMD Athlon Processor x86 Code Optimization 22007E/0November 1999
Instruction Dispatch and Execution Resources 205
206 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 207
208 Instruction Dispatch and Execution Resources
Table 20. MMX Instructions
Instruction Dispatch and Execution Resources 209
210 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 211
Tabl e 21. M MX Extensions
212 Instruction Dispatch and Execution Resources
Table 22. Floating-Point Instructions
Tabl e 21. M MX Extensions (Continued)
Instruction Dispatch and Execution Resources 213
214 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 215
216 Instruction Dispatch and Execution Resources
Instruction Dispatch and Execution Resources 217
Table 23. 3DNow! Instructions
218 Instruction Dispatch and Execution Resources
Table 23. 3DNow! Instructions (Continued)
Table 24. 3DNow! Extensions
Appendix G
DirectPath versus VectorPath Instructions
Select DirectPath Over VectorPath Instructions
DirectPath Instructions
220 DirectPath Instructions
Table 25.DirectPath Integer Instructions
DirectPath Instructions 221
222 DirectPath Instructions
DirectPath Instructions 223
224 DirectPath Instructions
DirectPath Instructions 225
Page
DirectPath Instructions 227
Table 26.DirectPath MMX Instructions
228 DirectPath Instructions
Table 27.DirectPath MMX Extensions
DirectPath Instructions 229
Table 28.DirectPath Floating-Point Instructions
Table 28.DirectPath Floating-Point Instructions
Page
VectorPath Instructions 231
VectorPath Instructions
and Table31, VectorPath MMX Extensions, on page 234
page 235 Table 29.VectorPath Integer Instructions
232 VectorPath Instructions
VectorPath Instructions 233
234 VectorPath Instructions
Table 30.VectorPath MMX Instructions
Table 31. VectorPath MMX Extensions
VectorPath Instructions 235
Table 32.VectorPath Floating-Point Instructions
Table 32.VectorPath Floating-Point Instructions (Continued)
Page
Index 237
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Index
Numerics
A
B
C
L
M
N
O
P