AMD AthlonTM Processor
Trademarks
Contents
Instruction Decoding Optimizations
Cache and Memory Optimizations
Scheduling Optimizations
Floating-Point Optimizations
General x86 Optimization Guidelines 127
Appendix B Pipeline and Execution Unit Resources Overview
Appendix E Programming the Mtrr and PAT
List of Figures
Xii
List of Tables
Xiv
Revision History
Xvi
Introduction
About this Document
Source Level Optimizations. Describes optimizations that
AMD Athlon Processor Family
AMD Athlon Processor Microarchitecture Summary
AMD Athlon Processor Microarchitecture Summary
AMD Athlon Processor x86 Code Optimization
Top Optimizations
Group I Optimizations Essential Optimizations
Memory Size and Alignment Issues
Optimization Star
Use the 3DNow! Prefetch and Prefetchw Instructions
Load-Execute Instruction Usage
Group II Optimizations-Secondary Optimizations
Select DirectPath Over VectorPath Instructions
Use Load-Execute Instructions
Avoid Branches Dependent on Random Data
Take Advantage of Write Combining
Use 3DNow! Instructions
Avoid Placing Code and Data in the Same 64-Byte Cache Line
22007E/0 November
Source Level Optimizations
Use 32-Bit Data Types for Integer Code
Example Preferred
Consider the Sign of Integer Operands
Example 1 Avoid
Use unsigned types for
Use Array Style Instead of Pointer Style Code
Example Avoid
Use signed types for
Vertex
Example 2 Preferred
Completely Unroll Small Loops
Avoid Unnecessary Store-to-Load Dependencies
Avoid Unnecessary Store-to-Load Dependencies
Consider Expression Order in Compound Branch Conditions
Use Prototypes for All Functions
Switch Statement Usage
Optimize Switch Statements
Example
Use Const Type Qualifier
Generic Loop Hoisting
Generalization for Multiple Constant Control Code
Declare Local Functions as Static
Dynamic Memory Allocation Consideration
Introduce Explicit Parallelism into Code
Explicitly Extract Common Subexpressions
Avoid
Language Structure Component Considerations
Example
Preferred
New ordering, with padding Preferred
Sort Local Variables According to Base Type Size
Original ordering Avoid
Accelerating Floating-Point Divides and Square Roots
Improved ordering Preferred
Accelerating Floating-Point Divides and Square Roots
Avoid Unnecessary Integer Division
AMD Athlon Processor x86 Code Optimization
Instruction Decoding Optimizations
Overview
Use Load-Execute Integer Instructions
Select DirectPath Over VectorPath Instructions
Load-Execute Instruction Usage
TOP
Align Branch Targets in Program Hot Spots
Use Short Instruction Lengths
Avoid Partial Register Reads and Writes
Example 2 Avoid
Use 8-Bit Sign-Extended Immediates
Replace Certain Shld Instructions with Alternative Code
Use 8-Bit Sign-Extended Displacements
Code Padding Using Neutral Code Fillers
Recommendations for the AMD Athlon Processor
Code Padding Using Neutral Code Fillers
AMD Athlon Processor x86 Code Optimization
NOP6EDI
AMD Athlon Processor x86 Code Optimization
Avoid Memory Size Mismatches
Memory Size and Alignment Issues
Cache and Memory Optimizations
Align Data Where Possible
Use the 3DNow! Prefetch and Prefetchw Instructions
Example Multiple Prefetches
Code
MOV ECX, -LARGENUM
Determining Prefetch Distance
Prefetch Distance = 200 DS/C bytes
Take Advantage of Write Combining
Avoid Placing Code and Data in the Same 64-Byte Cache Line
Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding Pitfalls -True Dependencies
Example 3 Avoid
Example 4 Avoid
Example 7 Avoid
Example 5 Preferred
Example 6 Avoid
Stack Alignment Considerations
Summary of Store-to-Load Forwarding Pitfalls to Avoid
Align Tbyte Variables on Quadword Aligned Addresses
Sort Variables According to Base Type Size
Branch Optimizations
Avoid Branches Dependent on Random Data
AMD Athlon Processor Specific Code
Blended AMD-K6and AMD Athlon Processor Code
Example 7 Integer Signum Function
Always Pair Call and Return
Example 6 Increment Ring Buffer Offset
Replace Branches with Computation in 3DNow! Code
Muxing Constructs
3DNow! code
Sample Code Translated into 3DNow! Code
Code
MM5
Pfsub
Psrad
Avoid the Loop Instruction
Avoid Far Control Transfer Instructions
Avoid Recursive Functions
Unrolling Loops
Scheduling Optimizations
Schedule Instructions According to their Latency
Complete Loop Unrolling
Partial Loop Unrolling
Without Loop Unrolling
With Partial Loop Unrolling
Unrolled Loops
Deriving Loop
Control For Partially
Example 1 rolled loop
Use Function Inlining
Overview
Avoid Address Generation Interlocks
Always Inline Functions if Called from One Site
Use Movzx and Movsx
Minimize Pointer Arithmetic in Loops
MOV ECX, Maxsize
Push Memory Data Carefully
Example 3 Preferred
Push Memory Data Carefully
Multiplication by Reciprocal Division Utility
Integer Optimizations
Replace Divides with Multiplies
Unsigned Division by Multiplication of Constant
Restricted Dividend
Signed Division by Multiplication of Constant
Simpler Code for
Remainder of Signed
Signed Division By
Signed Division By 2n
Integer 2 or
Use Alternative Code When Multiplying by a Constant
Integer 2n or -2n
ADD REG1, REG1 REG1, REG2 SHL
Use MMX Instructions for Integer-Only Work
Guidelines for Repeated String Instructions
Repeated String Instruction Usage
Latency of Repeated String Instructions
Align Source
Using Movq
Ensure DF=0 UP
Destination with
Use XOR Instruction to Clear Integer Registers
Efficient 64-Bit Integer Arithmetic
Example 6 Multiplication
Example 4 Left shift
Example 5 Right shift
Example 7 Division
EBX, ESP+12 Dividendlo
Example 8 Remainder
SHR EDX
Efficient Implementation of Population Count Function
Step
Bit field. Thus the following computation is performed
MOV EDX, EDX SHR
Derivation of Multiplier Used for Integer Division by
Utility sdiv.exe was compiled using the following code
MOV ECX EDX Imul ADD SHR SAR
Use Multiplies Rather than Divides
Floating-Point Optimizations
Ensure All FPU Data is Aligned
Floating-Point Compare Instructions
Use Ffreep Macro to Pop One Register from the FPU Stack
Avoid Using Extended-Precision Data
Use the Fxch Instruction Rather than FST/FLD Pairs
Minimize Floating-Point-to-Integer Conversions
Example 1 Fast
Example 2 Potentially faster
Example 3 Potentially faster
Example 4 Fastest
Floating-Point Subexpression Elimination
104
Take Advantage of the Fsincos Instruction
106
Use Femms Instruction
3DNow! and MMX Optimizations
Use 3DNow! Instructions
Optimized Full 24-Bit Precision Divide
Use 3DNow! Instructions for Fast Division
Optimized 14-Bit Precision Divide
Pipelined Pair of 24-Bit Precision Divides
Newton-Raphson Reciprocal
Optimized 15-Bit Precision Square Root
Optimized 24-Bit Precision Square Root
Newton-Raphson Reciprocal Square Root
Specific Code
3DNow! and MMX Intra-Operand Swapping
AMD Athlon
Blended Code
Fast Conversion of Signed Words to Floating-Point
Use MMX Pxor to Negate 3DNow! Data
Positive One Negative, One
Use MMX Pcmp Instead of 3DNow! Pfcmp
Both Numbers
Positive
Use MMX Instructions for Block Copies and Block Fills
AMD-K6and AMD Athlon Processor Blended Code
116
Processor Specific
Use MMX Pxor to Clear All Bits in an MMX Register
Optimized Matrix Multiplication
Use MMX Pcmpeqd to Set All Bits in an MMX Register
MOV EBX, RES
Optimized Matrix Multiplication 121
Data
Use 3DNow! Pavgusb for MPEG-2 Motion Compensation
MM0=QWORD1
Stream of Packed Unsigned Bytes
Complex Number Arithmetic
General x86 Optimization Guidelines
Short Forms
Stack Allocation
Dependencies
Register Operands
AMD Athlon Processor Microarchitecture
Introduction
AMD Athlon Processor Microarchitecture
Superscalar Processor
AMD Athlon Processor Microarchitecture 131
Predecode
Branch Prediction
Early Decoding
Instruction Control Unit
Data Cache
Integer Scheduler
Integer Execution Unit
Floating-Point Scheduler
Floating-Point Execution Unit
12 to
Load-Store Unit LSU
Load/Store Unit
AMD Athlon System Bus
L2 Cache Controller
Write Combining
140 AMD Athlon Processor Microarchitecture
Pipeline and Execution Unit Resources Overview
Fetch and Decode Pipeline Stages
C T L R O M
Cycle 3 DirectPath
Cycle 1-FETCH
Cycle 2-SCAN
Cycle 3 VectorPath
Integer Pipeline Stages
Integer Pipeline Stages
Cycle 9-ADDGEN
Cycle 7-SCHED
Cycle 8-EXEC
Cycle 10 -DCACC
Floating-Point Pipeline Stages
Floating-Point Pipeline Stages
Cycle 9-SCHEDW
Cycle 7-STKREN
Cycle 8-REGREN
Cycle 10 -SCHED
Operands
Execution Unit Resources
Terminology
Results
Integer Decode Types
Integer Pipeline Operations
Integer Pipeline Operation Types
Floating-Point Decode Types
Floating-Point Pipeline Operations
Floating-Point Pipeline Operation Types
Stage 1 Cycle Stage 2 Cycle Stage 3 Cycle
Load/Store Pipeline Operations
Load/Store Unit Stages
Code Sample Analysis
INC EBX ADD ESI, EDX
Imul EAX, ECX INC ESI MOV
ADD EDI, EBX SHL
SAR
Sample 2 Integer Register and Memory Load Operations
DEC EDX MOV EDI, ECX SUB
Appendix C
Implementation Write Combining
Programming Details
What is Write Combining?
Write-Combining Definitions and Abbreviations
Write-Combining Operations
Write Combining Completion Events
INIT, Halt
AMD Athlon System Bus Commands Generation Rules
Sending Write-Buffer Data to the System
160 Write-Combining Operations
Performance-Monitoring Counters
Performance Counter Usage
PerfEvtSel30 MSRs MSR Addresses C0010000h-C0010003h
PerfEvtSel30 Registers
Performance Counter Usage 163
Performance-Monitoring Counters
Snoop hits
65h
73h
74h
Instruction cache fetches
Event Source Event Description
Waited to use the L2
Instruction cache misses
PerfCtr30 MSRs MSR Addresses C0010004h-C0010007h
Event and Time-Stamp Monitoring Software
Starting and Stopping the Performance-Monitoring Counters
Monitoring Counter Overflow
170 Monitoring Counter Overflow
Programming the Mtrr
Memory Type Range Register Mtrr Mechanism
172 Memory Type Range Register Mtrr Mechanism
100000h Kbytes each Fixed Ranges C0000h 80000h
Fixed Ranges
FFFFFFFFh
Mtrr Capability
Memory Types
Memory Type Encodings
Register Format
Memory Type Range Register Mtrr Mechanism 175
Standard Mtrr Types and Properties
Attribute Table PAT
Attribute Table MSR 277h
MTRRs and PAT
PAT Entry Reset Value
PATi 3-Bit Encodings
PATi
Effective Memory Type Based on PAT and MTRRs
PAT Memory Type Mtrr Memory Type
Final Output Memory Types
Input Memory Type
Attribute Table PAT 181
Bffff Bbfff B7FFF B3FFF Affff Abfff A7FFF A3FFF
7FFFF 6FFFF 5FFFF 4FFFF 3FFFF 2FFFF 1FFFF 0FFFF
9FFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF
C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF
Attribute Table PAT 183
MTRRphysMaskn Register Format
MTRR-Related Model-Specific Register MSR Map
186
Appendix F
Instruction Dispatch Execution Resources
AAD
Integer Instructions
AAA
AAM
ADC mem8, reg8
ModR/M Decode Byte
ADC mreg8, reg8
ADC mreg16/32, reg16/32
Bswap ECX
Bound
Bswap EAX
Bswap EDX
CLD
CBW/CWDE
CLC
CLI
CMOVG/CMOVNLE reg16/32, reg16/32 0Fh
CMOVE/CMOVZ reg16/32, reg16/32 0Fh
CMOVE/CMOVZ reg16/32, mem16/32 0Fh
CMOVG/CMOVNLE reg16/32, mem16/32 0Fh
DAA
Cpuid
CWD/CDQ
DAS
AX, DX
Enter
AL, DX
EAX, DX
Invd
Invlpg
Lahf
Leave
LSL reg16/32, mreg16/32 0Fh 03h
LOOPE/LOOPZ disp8 E1h
LOOPNE/LOOPNZ disp8 E0h
LSL reg16/32, mem16/32 0Fh 03h
NOP Xchg EAX, EAX
OUT DX, EAX
OUT DX, AL
OUT DX, AX
POP ES
POP EBP
POP EBX
POP ESP
POP ESI
Rdtsc
Rdmsr
Rdpmc
Sahf
SBB reg8, mreg8
SBB mreg16/32, reg16/32
SBB mem16/32, reg16/32
SBB reg8, mem8
Setns mreg8
Sets mreg8
Sets mem8
Setns mem8
STI
STC
STD
Sysexit
Syscall
Sysenter
Sysret
Xchg EAX, EDX
Xchg EAX, EAX
Xchg EAX, ECX
Xchg EAX, EBX
MMX Instructions
Emms
Pandn mmreg, mem64
Pandn mmreg1, mmreg2
DFh
Pcmpeqb mmreg1, mmreg2
FADD/FMUL
MMX Extensions
FPU
Floating-Point Instructions
Fdecstp
Fcompp
Fcos
FLD1
Fincstp
Finit
FLDLG2
FLDL2E
FLDL2T
FLDLN2
Fucom
Fstsw AX
Ftst
Fucomp
DNow! Instructions
Femms
DNow! Extensions
DirectPath versus VectorPath Instructions
DirectPath Instructions
CBW/CWDE CLC CMC
DirectPath Integer Instructions
BT mreg16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8
DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32
INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 JO short disp8
222 DirectPath Instructions
DirectPath Instructions 223
224 DirectPath Instructions
Wait Xchg EAX, EAX
226 DirectPath Instructions
DirectPath MMX Instructions
DirectPath MMX Extensions
Fist mem16int
DirectPath Floating-Point Instructions
Fcompp Fdecstp
FLD1 FLDL2E FLDL2T FLDLG2 FLDLN2 Fldpi Fldz
Ftst Fucom Fucomp Fucompp Fwait Fxch
AAA AAD AAM AAS
VectorPath Instructions
VectorPath Integer Instructions
CLD CLI Clts
Instruction Mnemonic DIV EAX, mem16/32
AL, DX AX, DX EAX, DX Invd Invlpg
PUSHA/PUSHAD PUSHF/PUSHFD
POP mreg 16/32 POP mem 16/32
Push mreg16/32 Push mem16/32
Rdmsr Rdpmc Rdtsc
Syscall Sysenter Sysexit Sysret
VectorPath MMX Instructions
VectorPath MMX Extensions
Wbinvd Wrmsr
Fscale Fsin Fsincos
VectorPath Floating-Point Instructions
Fptan Fpatan Frndint
Fxam Fxtract FYL2X FYL2XP1
236
Index
238 Index
Index 239
240 Index