AMD AthlonTM Processor
 Trademarks
 Contents
 Instruction Decoding Optimizations
 Cache and Memory Optimizations
 Scheduling Optimizations
 Floating-Point Optimizations
 General x86 Optimization Guidelines 127
 Appendix B Pipeline and Execution Unit Resources Overview
 Appendix E Programming the Mtrr and PAT
 List of Figures
 Xii
 List of Tables
 Xiv
 Revision History
 Xvi
 About this Document
Introduction
 Source Level Optimizations. Describes optimizations that
 AMD Athlon Processor Family
 AMD Athlon Processor Microarchitecture Summary
 AMD Athlon Processor Microarchitecture Summary
 AMD Athlon Processor x86 Code Optimization
 Top Optimizations
 Use the 3DNow! Prefetch and Prefetchw Instructions
Memory Size and Alignment Issues
Optimization Star
Group I Optimizations Essential Optimizations
 Use Load-Execute Instructions
Group II Optimizations-Secondary Optimizations
Select DirectPath Over VectorPath Instructions
Load-Execute Instruction Usage
 Avoid Branches Dependent on Random Data
Take Advantage of Write Combining
Use 3DNow! Instructions
 Avoid Placing Code and Data in the Same 64-Byte Cache Line
 22007E/0 November
 Use 32-Bit Data Types for Integer Code
Source Level Optimizations
 Example Preferred
Consider the Sign of Integer Operands
Example 1 Avoid
 Use signed types for
Use Array Style Instead of Pointer Style Code
Example Avoid
Use unsigned types for
 Vertex
 Example 2 Preferred
 Avoid Unnecessary Store-to-Load Dependencies
Completely Unroll Small Loops
 Avoid Unnecessary Store-to-Load Dependencies
 Consider Expression Order in Compound Branch Conditions
 Use Prototypes for All Functions
Switch Statement Usage
Optimize Switch Statements
 Example
Use Const Type Qualifier
Generic Loop Hoisting
 Generalization for Multiple Constant Control Code
 Declare Local Functions as Static
 Introduce Explicit Parallelism into Code
Dynamic Memory Allocation Consideration
 Explicitly Extract Common Subexpressions
 Preferred
Language Structure Component Considerations
Example
Avoid
 New ordering, with padding Preferred
Sort Local Variables According to Base Type Size
Original ordering Avoid
 Improved ordering Preferred
Accelerating Floating-Point Divides and Square Roots
 Accelerating Floating-Point Divides and Square Roots
 Avoid Unnecessary Integer Division
 AMD Athlon Processor x86 Code Optimization
 Overview
Instruction Decoding Optimizations
 Use Load-Execute Integer Instructions
Select DirectPath Over VectorPath Instructions
Load-Execute Instruction Usage
 TOP
 Use Short Instruction Lengths
Align Branch Targets in Program Hot Spots
 Example 2 Avoid
Avoid Partial Register Reads and Writes
 Replace Certain Shld Instructions with Alternative Code
Use 8-Bit Sign-Extended Immediates
 Code Padding Using Neutral Code Fillers
Use 8-Bit Sign-Extended Displacements
 Recommendations for the AMD Athlon Processor
 Code Padding Using Neutral Code Fillers
 AMD Athlon Processor x86 Code Optimization
 NOP6EDI
 AMD Athlon Processor x86 Code Optimization
 Avoid Memory Size Mismatches
Memory Size and Alignment Issues
Cache and Memory Optimizations
 Use the 3DNow! Prefetch and Prefetchw Instructions
Align Data Where Possible
 Code
Example Multiple Prefetches
 MOV ECX, -LARGENUM
 Prefetch Distance = 200 DS/C bytes
Determining Prefetch Distance
 Avoid Placing Code and Data in the Same 64-Byte Cache Line
Take Advantage of Write Combining
 Store-to-Load Forwarding Pitfalls -True Dependencies
Store-to-Load Forwarding Restrictions
 Example 4 Avoid
Example 3 Avoid
 Example 7 Avoid
Example 5 Preferred
Example 6 Avoid
 Summary of Store-to-Load Forwarding Pitfalls to Avoid
Stack Alignment Considerations
 Align Tbyte Variables on Quadword Aligned Addresses
 Sort Variables According to Base Type Size
 Avoid Branches Dependent on Random Data
Branch Optimizations
 Blended AMD-K6and AMD Athlon Processor Code
AMD Athlon Processor Specific Code
 Example 7 Integer Signum Function
Always Pair Call and Return
Example 6 Increment Ring Buffer Offset
 Muxing Constructs
Replace Branches with Computation in 3DNow! Code
 3DNow! code
Sample Code Translated into 3DNow! Code
Code
 MM5
 Pfsub
 Psrad
 Avoid Far Control Transfer Instructions
Avoid the Loop Instruction
 Avoid Recursive Functions
 Complete Loop Unrolling
Scheduling Optimizations
Schedule Instructions According to their Latency
Unrolling Loops
 Partial Loop Unrolling
 With Partial Loop Unrolling
Without Loop Unrolling
 Example 1 rolled loop
Deriving Loop
Control For Partially
Unrolled Loops
 Overview
Use Function Inlining
 Always Inline Functions if Called from One Site
Avoid Address Generation Interlocks
 Minimize Pointer Arithmetic in Loops
Use Movzx and Movsx
 MOV ECX, Maxsize
 Example 3 Preferred
Push Memory Data Carefully
 Push Memory Data Carefully
 Multiplication by Reciprocal Division Utility
Integer Optimizations
Replace Divides with Multiplies
 Unsigned Division by Multiplication of Constant
 Restricted Dividend
Signed Division by Multiplication of Constant
Simpler Code for
 Integer 2 or
Signed Division By
Signed Division By 2n
Remainder of Signed
 Integer 2n or -2n
Use Alternative Code When Multiplying by a Constant
 ADD REG1, REG1 REG1, REG2 SHL
 Use MMX Instructions for Integer-Only Work
 Guidelines for Repeated String Instructions
Repeated String Instruction Usage
Latency of Repeated String Instructions
 Destination with
Using Movq
Ensure DF=0 UP
Align Source
 Efficient 64-Bit Integer Arithmetic
Use XOR Instruction to Clear Integer Registers
 Example 6 Multiplication
Example 4 Left shift
Example 5 Right shift
 EBX, ESP+12 Dividendlo
Example 7 Division
 Example 8 Remainder
 SHR EDX
 Step
Efficient Implementation of Population Count Function
 Bit field. Thus the following computation is performed
 MOV EDX, EDX SHR
 Derivation of Multiplier Used for Integer Division by
 Utility sdiv.exe was compiled using the following code
 MOV ECX EDX Imul ADD SHR SAR
 Use Multiplies Rather than Divides
Floating-Point Optimizations
Ensure All FPU Data is Aligned
 Use Ffreep Macro to Pop One Register from the FPU Stack
Floating-Point Compare Instructions
 Use the Fxch Instruction Rather than FST/FLD Pairs
Avoid Using Extended-Precision Data
 Example 1 Fast
Minimize Floating-Point-to-Integer Conversions
 Example 2 Potentially faster
 Example 4 Fastest
Example 3 Potentially faster
 Floating-Point Subexpression Elimination
 104
 Take Advantage of the Fsincos Instruction
 106
 Use Femms Instruction
3DNow! and MMX Optimizations
Use 3DNow! Instructions
 Optimized Full 24-Bit Precision Divide
Use 3DNow! Instructions for Fast Division
Optimized 14-Bit Precision Divide
 Newton-Raphson Reciprocal
Pipelined Pair of 24-Bit Precision Divides
 Optimized 24-Bit Precision Square Root
Optimized 15-Bit Precision Square Root
 Newton-Raphson Reciprocal Square Root
 Blended Code
3DNow! and MMX Intra-Operand Swapping
AMD Athlon
Specific Code
 Use MMX Pxor to Negate 3DNow! Data
Fast Conversion of Signed Words to Floating-Point
 Positive
Use MMX Pcmp Instead of 3DNow! Pfcmp
Both Numbers
Positive One Negative, One
 AMD-K6and AMD Athlon Processor Blended Code
Use MMX Instructions for Block Copies and Block Fills
 116
 Processor Specific
 Use MMX Pxor to Clear All Bits in an MMX Register
 Use MMX Pcmpeqd to Set All Bits in an MMX Register
Optimized Matrix Multiplication
 MOV EBX, RES
 Optimized Matrix Multiplication 121
 Data
 Use 3DNow! Pavgusb for MPEG-2 Motion Compensation
 MM0=QWORD1
 Stream of Packed Unsigned Bytes
 Complex Number Arithmetic
 Short Forms
General x86 Optimization Guidelines
 Stack Allocation
Dependencies
Register Operands
 Introduction
AMD Athlon Processor Microarchitecture
 Superscalar Processor
AMD Athlon Processor Microarchitecture
 AMD Athlon Processor Microarchitecture 131
 Branch Prediction
Predecode
 Early Decoding
 Data Cache
Instruction Control Unit
 Integer Execution Unit
Integer Scheduler
 Floating-Point Scheduler
 12 to
Floating-Point Execution Unit
 Load/Store Unit
Load-Store Unit LSU
 AMD Athlon System Bus
L2 Cache Controller
Write Combining
 140 AMD Athlon Processor Microarchitecture
 Fetch and Decode Pipeline Stages
Pipeline and Execution Unit Resources Overview
 C T L R O M
 Cycle 3 VectorPath
Cycle 1-FETCH
Cycle 2-SCAN
Cycle 3 DirectPath
 Integer Pipeline Stages
Integer Pipeline Stages
 Cycle 10 -DCACC
Cycle 7-SCHED
Cycle 8-EXEC
Cycle 9-ADDGEN
 Floating-Point Pipeline Stages
Floating-Point Pipeline Stages
 Cycle 10 -SCHED
Cycle 7-STKREN
Cycle 8-REGREN
Cycle 9-SCHEDW
 Results
Execution Unit Resources
Terminology
Operands
 Integer Decode Types
Integer Pipeline Operations
Integer Pipeline Operation Types
 Floating-Point Decode Types
Floating-Point Pipeline Operations
Floating-Point Pipeline Operation Types
 Stage 1 Cycle Stage 2 Cycle Stage 3 Cycle
Load/Store Pipeline Operations
Load/Store Unit Stages
 Code Sample Analysis
 INC EBX ADD ESI, EDX
Imul EAX, ECX INC ESI MOV
ADD EDI, EBX SHL
 SAR
Sample 2 Integer Register and Memory Load Operations
DEC EDX MOV EDI, ECX SUB
 Implementation Write Combining
Appendix C
 Programming Details
What is Write Combining?
Write-Combining Definitions and Abbreviations
 Write-Combining Operations
 INIT, Halt
Write Combining Completion Events
 Sending Write-Buffer Data to the System
AMD Athlon System Bus Commands Generation Rules
 160 Write-Combining Operations
 Performance Counter Usage
Performance-Monitoring Counters
 PerfEvtSel30 Registers
PerfEvtSel30 MSRs MSR Addresses C0010000h-C0010003h
 Performance Counter Usage 163
 Performance-Monitoring Counters
 74h
65h
73h
Snoop hits
 Instruction cache misses
Event Source Event Description
Waited to use the L2
Instruction cache fetches
 PerfCtr30 MSRs MSR Addresses C0010004h-C0010007h
 Starting and Stopping the Performance-Monitoring Counters
Event and Time-Stamp Monitoring Software
 Monitoring Counter Overflow
 170 Monitoring Counter Overflow
 Memory Type Range Register Mtrr Mechanism
Programming the Mtrr
 172 Memory Type Range Register Mtrr Mechanism
 100000h Kbytes each Fixed Ranges C0000h 80000h
Fixed Ranges
FFFFFFFFh
 Register Format
Memory Types
Memory Type Encodings
Mtrr Capability
 Memory Type Range Register Mtrr Mechanism 175
 Standard Mtrr Types and Properties
 Attribute Table MSR 277h
Attribute Table PAT
 PATi
PAT Entry Reset Value
PATi 3-Bit Encodings
MTRRs and PAT
 PAT Memory Type Mtrr Memory Type
Effective Memory Type Based on PAT and MTRRs
 Input Memory Type
Final Output Memory Types
 Attribute Table PAT 181
 C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF
7FFFF 6FFFF 5FFFF 4FFFF 3FFFF 2FFFF 1FFFF 0FFFF
9FFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF
Bffff Bbfff B7FFF B3FFF Affff Abfff A7FFF A3FFF
 Attribute Table PAT 183
 MTRRphysMaskn Register Format
 MTRR-Related Model-Specific Register MSR Map
 186
 Instruction Dispatch Execution Resources
Appendix F
 AAM
Integer Instructions
AAA
AAD
 ADC mreg16/32, reg16/32
ModR/M Decode Byte
ADC mreg8, reg8
ADC mem8, reg8
 Bswap EDX
Bound
Bswap EAX
Bswap ECX
 CLI
CBW/CWDE
CLC
CLD
 CMOVG/CMOVNLE reg16/32, mem16/32 0Fh
CMOVE/CMOVZ reg16/32, reg16/32 0Fh
CMOVE/CMOVZ reg16/32, mem16/32 0Fh
CMOVG/CMOVNLE reg16/32, reg16/32 0Fh
 DAS
Cpuid
CWD/CDQ
DAA
 EAX, DX
Enter
AL, DX
AX, DX
 Invlpg
Invd
 Leave
Lahf
 LSL reg16/32, mem16/32 0Fh 03h
LOOPE/LOOPZ disp8 E1h
LOOPNE/LOOPNZ disp8 E0h
LSL reg16/32, mreg16/32 0Fh 03h
 NOP Xchg EAX, EAX
 POP ES
OUT DX, AL
OUT DX, AX
OUT DX, EAX
 POP ESI
POP EBX
POP ESP
POP EBP
 Rdtsc
Rdmsr
Rdpmc
 Sahf
 SBB reg8, mem8
SBB mreg16/32, reg16/32
SBB mem16/32, reg16/32
SBB reg8, mreg8
 Setns mem8
Sets mreg8
Sets mem8
Setns mreg8
 STI
STC
STD
 Sysret
Syscall
Sysenter
Sysexit
 Xchg EAX, EBX
Xchg EAX, EAX
Xchg EAX, ECX
Xchg EAX, EDX
 Emms
MMX Instructions
 Pcmpeqb mmreg1, mmreg2
Pandn mmreg1, mmreg2
DFh
Pandn mmreg, mem64
 FADD/FMUL
 FPU
MMX Extensions
 Floating-Point Instructions
 Fdecstp
Fcompp
Fcos
 FLD1
Fincstp
Finit
 FLDLN2
FLDL2E
FLDL2T
FLDLG2
 Fucomp
Fstsw AX
Ftst
Fucom
 Femms
DNow! Instructions
 DNow! Extensions
 DirectPath Instructions
DirectPath versus VectorPath Instructions
 CBW/CWDE CLC CMC
DirectPath Integer Instructions
BT mreg16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8
 INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 JO short disp8
DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32
 222 DirectPath Instructions
 DirectPath Instructions 223
 224 DirectPath Instructions
 Wait Xchg EAX, EAX
 226 DirectPath Instructions
 DirectPath MMX Instructions
 DirectPath MMX Extensions
 FLD1 FLDL2E FLDL2T FLDLG2 FLDLN2 Fldpi Fldz
DirectPath Floating-Point Instructions
Fcompp Fdecstp
Fist mem16int
 Ftst Fucom Fucomp Fucompp Fwait Fxch
 CLD CLI Clts
VectorPath Instructions
VectorPath Integer Instructions
AAA AAD AAM AAS
 AL, DX AX, DX EAX, DX Invd Invlpg
Instruction Mnemonic DIV EAX, mem16/32
 Rdmsr Rdpmc Rdtsc
POP mreg 16/32 POP mem 16/32
Push mreg16/32 Push mem16/32
PUSHA/PUSHAD PUSHF/PUSHFD
 Wbinvd Wrmsr
VectorPath MMX Instructions
VectorPath MMX Extensions
Syscall Sysenter Sysexit Sysret
 Fxam Fxtract FYL2X FYL2XP1
VectorPath Floating-Point Instructions
Fptan Fpatan Frndint
Fscale Fsin Fsincos
 236
 Index
 238 Index
 Index 239
 240 Index