AMD AthlonTM Processor
 Trademarks
 Contents
 Instruction Decoding Optimizations
 Cache and Memory Optimizations
 Scheduling Optimizations
 Floating-Point Optimizations
 General x86 Optimization Guidelines 127
 Appendix B Pipeline and Execution Unit Resources Overview
 Appendix E Programming the Mtrr and PAT
 List of Figures
 Xii
 List of Tables
 Xiv
 Revision History
 Xvi
 Introduction
About this Document
 Source Level Optimizations. Describes optimizations that
 AMD Athlon Processor Family
 AMD Athlon Processor Microarchitecture Summary
 AMD Athlon Processor Microarchitecture Summary
 AMD Athlon Processor x86 Code Optimization
 Top Optimizations
 Group I Optimizations Essential Optimizations
Memory Size and Alignment Issues
Optimization Star
Use the 3DNow! Prefetch and Prefetchw Instructions
 Load-Execute Instruction Usage
Group II Optimizations-Secondary Optimizations
Select DirectPath Over VectorPath Instructions
Use Load-Execute Instructions
 Use 3DNow! Instructions
Take Advantage of Write Combining
Avoid Branches Dependent on Random Data
 Avoid Placing Code and Data in the Same 64-Byte Cache Line
 22007E/0 November
 Source Level Optimizations
Use 32-Bit Data Types for Integer Code
 Example 1 Avoid
Consider the Sign of Integer Operands
Example Preferred
 Use unsigned types for
Use Array Style Instead of Pointer Style Code
Example Avoid
Use signed types for
 Vertex
 Example 2 Preferred
 Completely Unroll Small Loops
Avoid Unnecessary Store-to-Load Dependencies
 Avoid Unnecessary Store-to-Load Dependencies
 Consider Expression Order in Compound Branch Conditions
 Optimize Switch Statements
Switch Statement Usage
Use Prototypes for All Functions
 Generic Loop Hoisting
Use Const Type Qualifier
Example
 Generalization for Multiple Constant Control Code
 Declare Local Functions as Static
 Dynamic Memory Allocation Consideration
Introduce Explicit Parallelism into Code
 Explicitly Extract Common Subexpressions
 Avoid
Language Structure Component Considerations
Example
Preferred
 Original ordering Avoid
Sort Local Variables According to Base Type Size
New ordering, with padding Preferred
 Accelerating Floating-Point Divides and Square Roots
Improved ordering Preferred
 Accelerating Floating-Point Divides and Square Roots
 Avoid Unnecessary Integer Division
 AMD Athlon Processor x86 Code Optimization
 Instruction Decoding Optimizations
Overview
 Load-Execute Instruction Usage
Select DirectPath Over VectorPath Instructions
Use Load-Execute Integer Instructions
 TOP
 Align Branch Targets in Program Hot Spots
Use Short Instruction Lengths
 Avoid Partial Register Reads and Writes
Example 2 Avoid
 Use 8-Bit Sign-Extended Immediates
Replace Certain Shld Instructions with Alternative Code
 Use 8-Bit Sign-Extended Displacements
Code Padding Using Neutral Code Fillers
 Recommendations for the AMD Athlon Processor
 Code Padding Using Neutral Code Fillers
 AMD Athlon Processor x86 Code Optimization
 NOP6EDI
 AMD Athlon Processor x86 Code Optimization
 Cache and Memory Optimizations
Memory Size and Alignment Issues
Avoid Memory Size Mismatches
 Align Data Where Possible
Use the 3DNow! Prefetch and Prefetchw Instructions
 Example Multiple Prefetches
Code
 MOV ECX, -LARGENUM
 Determining Prefetch Distance
Prefetch Distance = 200 DS/C bytes
 Take Advantage of Write Combining
Avoid Placing Code and Data in the Same 64-Byte Cache Line
 Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding Pitfalls -True Dependencies
 Example 3 Avoid
Example 4 Avoid
 Example 6 Avoid
Example 5 Preferred
Example 7 Avoid
 Stack Alignment Considerations
Summary of Store-to-Load Forwarding Pitfalls to Avoid
 Align Tbyte Variables on Quadword Aligned Addresses
 Sort Variables According to Base Type Size
 Branch Optimizations
Avoid Branches Dependent on Random Data
 AMD Athlon Processor Specific Code
Blended AMD-K6and AMD Athlon Processor Code
 Example 6 Increment Ring Buffer Offset
Always Pair Call and Return
Example 7 Integer Signum Function
 Replace Branches with Computation in 3DNow! Code
Muxing Constructs
 Code
Sample Code Translated into 3DNow! Code
3DNow! code
 MM5
 Pfsub
 Psrad
 Avoid the Loop Instruction
Avoid Far Control Transfer Instructions
 Avoid Recursive Functions
 Unrolling Loops
Scheduling Optimizations
Schedule Instructions According to their Latency
Complete Loop Unrolling
 Partial Loop Unrolling
 Without Loop Unrolling
With Partial Loop Unrolling
 Unrolled Loops
Deriving Loop
Control For Partially
Example 1 rolled loop
 Use Function Inlining
Overview
 Avoid Address Generation Interlocks
Always Inline Functions if Called from One Site
 Use Movzx and Movsx
Minimize Pointer Arithmetic in Loops
 MOV ECX, Maxsize
 Push Memory Data Carefully
Example 3 Preferred
 Push Memory Data Carefully
 Replace Divides with Multiplies
Integer Optimizations
Multiplication by Reciprocal Division Utility
 Unsigned Division by Multiplication of Constant
 Simpler Code for
Signed Division by Multiplication of Constant
Restricted Dividend
 Remainder of Signed
Signed Division By
Signed Division By 2n
Integer 2 or
 Use Alternative Code When Multiplying by a Constant
Integer 2n or -2n
 ADD REG1, REG1 REG1, REG2 SHL
 Use MMX Instructions for Integer-Only Work
 Latency of Repeated String Instructions
Repeated String Instruction Usage
Guidelines for Repeated String Instructions
 Align Source
Using Movq
Ensure DF=0 UP
Destination with
 Use XOR Instruction to Clear Integer Registers
Efficient 64-Bit Integer Arithmetic
 Example 5 Right shift
Example 4 Left shift
Example 6 Multiplication
 Example 7 Division
EBX, ESP+12 Dividendlo
 Example 8 Remainder
 SHR EDX
 Efficient Implementation of Population Count Function
Step
 Bit field. Thus the following computation is performed
 MOV EDX, EDX SHR
 Derivation of Multiplier Used for Integer Division by
 Utility sdiv.exe was compiled using the following code
 MOV ECX EDX Imul ADD SHR SAR
 Ensure All FPU Data is Aligned
Floating-Point Optimizations
Use Multiplies Rather than Divides
 Floating-Point Compare Instructions
Use Ffreep Macro to Pop One Register from the FPU Stack
 Avoid Using Extended-Precision Data
Use the Fxch Instruction Rather than FST/FLD Pairs
 Minimize Floating-Point-to-Integer Conversions
Example 1 Fast
 Example 2 Potentially faster
 Example 3 Potentially faster
Example 4 Fastest
 Floating-Point Subexpression Elimination
 104
 Take Advantage of the Fsincos Instruction
 106
 Use 3DNow! Instructions
3DNow! and MMX Optimizations
Use Femms Instruction
 Optimized 14-Bit Precision Divide
Use 3DNow! Instructions for Fast Division
Optimized Full 24-Bit Precision Divide
 Pipelined Pair of 24-Bit Precision Divides
Newton-Raphson Reciprocal
 Optimized 15-Bit Precision Square Root
Optimized 24-Bit Precision Square Root
 Newton-Raphson Reciprocal Square Root
 Specific Code
3DNow! and MMX Intra-Operand Swapping
AMD Athlon
Blended Code
 Fast Conversion of Signed Words to Floating-Point
Use MMX Pxor to Negate 3DNow! Data
 Positive One Negative, One
Use MMX Pcmp Instead of 3DNow! Pfcmp
Both Numbers
Positive
 Use MMX Instructions for Block Copies and Block Fills
AMD-K6and AMD Athlon Processor Blended Code
 116
 Processor Specific
 Use MMX Pxor to Clear All Bits in an MMX Register
 Optimized Matrix Multiplication
Use MMX Pcmpeqd to Set All Bits in an MMX Register
 MOV EBX, RES
 Optimized Matrix Multiplication 121
 Data
 Use 3DNow! Pavgusb for MPEG-2 Motion Compensation
 MM0=QWORD1
 Stream of Packed Unsigned Bytes
 Complex Number Arithmetic
 General x86 Optimization Guidelines
Short Forms
 Register Operands
Dependencies
Stack Allocation
 AMD Athlon Processor Microarchitecture
Introduction
 AMD Athlon Processor Microarchitecture
Superscalar Processor
 AMD Athlon Processor Microarchitecture 131
 Predecode
Branch Prediction
 Early Decoding
 Instruction Control Unit
Data Cache
 Integer Scheduler
Integer Execution Unit
 Floating-Point Scheduler
 Floating-Point Execution Unit
12 to
 Load-Store Unit LSU
Load/Store Unit
 Write Combining
L2 Cache Controller
AMD Athlon System Bus
 140 AMD Athlon Processor Microarchitecture
 Pipeline and Execution Unit Resources Overview
Fetch and Decode Pipeline Stages
 C T L R O M
 Cycle 3 DirectPath
Cycle 1-FETCH
Cycle 2-SCAN
Cycle 3 VectorPath
 Integer Pipeline Stages
Integer Pipeline Stages
 Cycle 9-ADDGEN
Cycle 7-SCHED
Cycle 8-EXEC
Cycle 10 -DCACC
 Floating-Point Pipeline Stages
Floating-Point Pipeline Stages
 Cycle 9-SCHEDW
Cycle 7-STKREN
Cycle 8-REGREN
Cycle 10 -SCHED
 Operands
Execution Unit Resources
Terminology
Results
 Integer Pipeline Operation Types
Integer Pipeline Operations
Integer Decode Types
 Floating-Point Pipeline Operation Types
Floating-Point Pipeline Operations
Floating-Point Decode Types
 Load/Store Unit Stages
Load/Store Pipeline Operations
Stage 1 Cycle Stage 2 Cycle Stage 3 Cycle
 Code Sample Analysis
 ADD EDI, EBX SHL
Imul EAX, ECX INC ESI MOV
INC EBX ADD ESI, EDX
 DEC EDX MOV EDI, ECX SUB
Sample 2 Integer Register and Memory Load Operations
SAR
 Appendix C
Implementation Write Combining
 Write-Combining Definitions and Abbreviations
What is Write Combining?
Programming Details
 Write-Combining Operations
 Write Combining Completion Events
INIT, Halt
 AMD Athlon System Bus Commands Generation Rules
Sending Write-Buffer Data to the System
 160 Write-Combining Operations
 Performance-Monitoring Counters
Performance Counter Usage
 PerfEvtSel30 MSRs MSR Addresses C0010000h-C0010003h
PerfEvtSel30 Registers
 Performance Counter Usage 163
 Performance-Monitoring Counters
 Snoop hits
65h
73h
74h
 Instruction cache fetches
Event Source Event Description
Waited to use the L2
Instruction cache misses
 PerfCtr30 MSRs MSR Addresses C0010004h-C0010007h
 Event and Time-Stamp Monitoring Software
Starting and Stopping the Performance-Monitoring Counters
 Monitoring Counter Overflow
 170 Monitoring Counter Overflow
 Programming the Mtrr
Memory Type Range Register Mtrr Mechanism
 172 Memory Type Range Register Mtrr Mechanism
 FFFFFFFFh
Fixed Ranges
100000h Kbytes each Fixed Ranges C0000h 80000h
 Mtrr Capability
Memory Types
Memory Type Encodings
Register Format
 Memory Type Range Register Mtrr Mechanism 175
 Standard Mtrr Types and Properties
 Attribute Table PAT
Attribute Table MSR 277h
 MTRRs and PAT
PAT Entry Reset Value
PATi 3-Bit Encodings
PATi
 Effective Memory Type Based on PAT and MTRRs
PAT Memory Type Mtrr Memory Type
 Final Output Memory Types
Input Memory Type
 Attribute Table PAT 181
 Bffff Bbfff B7FFF B3FFF Affff Abfff A7FFF A3FFF
7FFFF 6FFFF 5FFFF 4FFFF 3FFFF 2FFFF 1FFFF 0FFFF
9FFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF
C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF
 Attribute Table PAT 183
 MTRRphysMaskn Register Format
 MTRR-Related Model-Specific Register MSR Map
 186
 Appendix F
Instruction Dispatch Execution Resources
 AAD
Integer Instructions
AAA
AAM
 ADC mem8, reg8
ModR/M Decode Byte
ADC mreg8, reg8
ADC mreg16/32, reg16/32
 Bswap ECX
Bound
Bswap EAX
Bswap EDX
 CLD
CBW/CWDE
CLC
CLI
 CMOVG/CMOVNLE reg16/32, reg16/32 0Fh
CMOVE/CMOVZ reg16/32, reg16/32 0Fh
CMOVE/CMOVZ reg16/32, mem16/32 0Fh
CMOVG/CMOVNLE reg16/32, mem16/32 0Fh
 DAA
Cpuid
CWD/CDQ
DAS
 AX, DX
Enter
AL, DX
EAX, DX
 Invd
Invlpg
 Lahf
Leave
 LSL reg16/32, mreg16/32 0Fh 03h
LOOPE/LOOPZ disp8 E1h
LOOPNE/LOOPNZ disp8 E0h
LSL reg16/32, mem16/32 0Fh 03h
 NOP Xchg EAX, EAX
 OUT DX, EAX
OUT DX, AL
OUT DX, AX
POP ES
 POP EBP
POP EBX
POP ESP
POP ESI
 Rdpmc
Rdmsr
Rdtsc
 Sahf
 SBB reg8, mreg8
SBB mreg16/32, reg16/32
SBB mem16/32, reg16/32
SBB reg8, mem8
 Setns mreg8
Sets mreg8
Sets mem8
Setns mem8
 STD
STC
STI
 Sysexit
Syscall
Sysenter
Sysret
 Xchg EAX, EDX
Xchg EAX, EAX
Xchg EAX, ECX
Xchg EAX, EBX
 MMX Instructions
Emms
 Pandn mmreg, mem64
Pandn mmreg1, mmreg2
DFh
Pcmpeqb mmreg1, mmreg2
 FADD/FMUL
 MMX Extensions
FPU
 Floating-Point Instructions
 Fcos
Fcompp
Fdecstp
 Finit
Fincstp
FLD1
 FLDLG2
FLDL2E
FLDL2T
FLDLN2
 Fucom
Fstsw AX
Ftst
Fucomp
 DNow! Instructions
Femms
 DNow! Extensions
 DirectPath versus VectorPath Instructions
DirectPath Instructions
 BT mreg16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8
DirectPath Integer Instructions
CBW/CWDE CLC CMC
 DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32
INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 JO short disp8
 222 DirectPath Instructions
 DirectPath Instructions 223
 224 DirectPath Instructions
 Wait Xchg EAX, EAX
 226 DirectPath Instructions
 DirectPath MMX Instructions
 DirectPath MMX Extensions
 Fist mem16int
DirectPath Floating-Point Instructions
Fcompp Fdecstp
FLD1 FLDL2E FLDL2T FLDLG2 FLDLN2 Fldpi Fldz
 Ftst Fucom Fucomp Fucompp Fwait Fxch
 AAA AAD AAM AAS
VectorPath Instructions
VectorPath Integer Instructions
CLD CLI Clts
 Instruction Mnemonic DIV EAX, mem16/32
AL, DX AX, DX EAX, DX Invd Invlpg
 PUSHA/PUSHAD PUSHF/PUSHFD
POP mreg 16/32 POP mem 16/32
Push mreg16/32 Push mem16/32
Rdmsr Rdpmc Rdtsc
 Syscall Sysenter Sysexit Sysret
VectorPath MMX Instructions
VectorPath MMX Extensions
Wbinvd Wrmsr
 Fscale Fsin Fsincos
VectorPath Floating-Point Instructions
Fptan Fpatan Frndint
Fxam Fxtract FYL2X FYL2XP1
 236
 Index
 238 Index
 Index 239
 240 Index