Manuals / AMD / Computer Equipment / Typewriter

AMD x86 Cycle 1-FETCH, Cycle 2-SCAN, Cycle 3 DirectPath, Cycle 3 VectorPath, Cycle 4 DirectPath

Models: x86

1 256

Download 256 pages 58.62 Kb

Page 159

22007E/0 — November 1999

AMD Athlon™ Processor x86 Code Optimization

Cycle 1–FETCH	The FETCH pipeline stage calculates the address of the next
	x86 instruction window to fetch from the processor caches or
	system memory.
Cycle 2–SCAN	SCAN determines the start and end pointers of instructions.
	SCAN can send up to six aligned instructions (DirectPath and
	VectorPath) to ALIGN1 and only one VectorPath instruction to
	the microcode engine (MENG) per cycle.
Cycle 3 (DirectPath) –	Because each 8-byte buffer (quadword queue) can contain up to
ALIGN1	three instructions, ALIGN1 can buffer up to a maximum of nine
	instructions, or 24 instruction bytes. ALIGN1 tries to send three
	instructions from an 8-byte buffer to ALIGN2 per cycle.
Cycle 3 (VectorPath)–	For VectorPath instructions, the microcode engine control
MECTL	(MECTL) stage of the pipeline generates the microcode entry
	points.
Cycle 4 (DirectPath) –	ALIGN2 prioritizes prefix bytes, determines the opcode,
ALIGN2	ModR/M, and SIB bytes for each instruction and sends the
	accumulated prefix information to EDEC.
Cycle 4 (VectorPath)–	In the microcode engine ROM (MEROM) pipeline stage, the
MEROM	entry-point generated in the previous cycle, MECTL, is used to
	index into the MROM to obtain the microcode lines necessary
	to decode the instruction sent by SCAN.
Cycle 5 (DirectPath) –	The early decode (EDEC) stage decodes information from the
EDEC	DirectPath stage (ALIGN2) and VectorPath stage (MEROM)
	into MacroOPs. In addition, EDEC determines register
	pointers, flag updates, immediate values, displacements, and
	other information. EDEC then selects either MacroOPs from
	the DirectPath or MacroOPs from the VectorPath to send to the
	instruction decoder (IDEC) stage.
Cycle 5 (VectorPath)–	The microcode engine decode (MEDEC) stage converts x86
MEDEC/MESEQ	instructions into MacroOPs. The microcode engine sequencer
	(MESEQ) performs the sequence controls (redirects and
	exceptions) for the MENG.
Cycle 6–	At the instruction decoder (IDEC)/rename stage, integer and
IDEC/Rename	floating-point MacroOPs diverge in the pipeline. Integer
	MacroOPs are scheduled for execution in the next cycle.
	Floating-point MacroOPs have their floating-point stack

Fetch and Decode Pipeline Stages

143

Image 159

AMD x86 manual Cycle 1-FETCH, Cycle 2-SCAN, Cycle 3 DirectPath, Cycle 3 VectorPath, Cycle 4 DirectPath, Cycle 4 VectorPath

Contents

AMD AthlonTM Processor Trademarks Contents Instruction Decoding Optimizations Cache and Memory Optimizations Scheduling Optimizations Floating-Point Optimizations General x86 Optimization Guidelines 127 Appendix B Pipeline and Execution Unit Resources Overview Appendix E Programming the Mtrr and PAT List of Figures Xii List of Tables Xiv Revision History Xvi About this Document Introduction Source Level Optimizations. Describes optimizations that AMD Athlon Processor Family AMD Athlon Processor Microarchitecture Summary AMD Athlon Processor Microarchitecture Summary AMD Athlon Processor x86 Code Optimization Top Optimizations Use the 3DNow! Prefetch and Prefetchw Instructions Memory Size and Alignment IssuesOptimization Star Group I Optimizations Essential Optimizations Use Load-Execute Instructions Group II Optimizations-Secondary OptimizationsSelect DirectPath Over VectorPath Instructions Load-Execute Instruction Usage Take Advantage of Write Combining Use 3DNow! InstructionsAvoid Branches Dependent on Random Data Avoid Placing Code and Data in the Same 64-Byte Cache Line 22007E/0 November Use 32-Bit Data Types for Integer Code Source Level Optimizations Consider the Sign of Integer Operands Example 1 AvoidExample Preferred Use signed types for Use Array Style Instead of Pointer Style CodeExample Avoid Use unsigned types for Vertex Example 2 Preferred Avoid Unnecessary Store-to-Load Dependencies Completely Unroll Small Loops Avoid Unnecessary Store-to-Load Dependencies Consider Expression Order in Compound Branch Conditions Switch Statement Usage Optimize Switch StatementsUse Prototypes for All Functions Use Const Type Qualifier Generic Loop HoistingExample Generalization for Multiple Constant Control Code Declare Local Functions as Static Introduce Explicit Parallelism into Code Dynamic Memory Allocation Consideration Explicitly Extract Common Subexpressions Preferred Language Structure Component ConsiderationsExample Avoid Sort Local Variables According to Base Type Size Original ordering AvoidNew ordering, with padding Preferred Improved ordering Preferred Accelerating Floating-Point Divides and Square Roots Accelerating Floating-Point Divides and Square Roots Avoid Unnecessary Integer Division AMD Athlon Processor x86 Code Optimization Overview Instruction Decoding Optimizations Select DirectPath Over VectorPath Instructions Load-Execute Instruction UsageUse Load-Execute Integer Instructions TOP Use Short Instruction Lengths Align Branch Targets in Program Hot Spots Example 2 Avoid Avoid Partial Register Reads and Writes Replace Certain Shld Instructions with Alternative Code Use 8-Bit Sign-Extended Immediates Code Padding Using Neutral Code Fillers Use 8-Bit Sign-Extended Displacements Recommendations for the AMD Athlon Processor Code Padding Using Neutral Code Fillers AMD Athlon Processor x86 Code Optimization NOP6EDI AMD Athlon Processor x86 Code Optimization Memory Size and Alignment Issues Cache and Memory OptimizationsAvoid Memory Size Mismatches Use the 3DNow! Prefetch and Prefetchw Instructions Align Data Where Possible Code Example Multiple Prefetches MOV ECX, -LARGENUM Prefetch Distance = 200 DS/C bytes Determining Prefetch Distance Avoid Placing Code and Data in the Same 64-Byte Cache Line Take Advantage of Write Combining Store-to-Load Forwarding Pitfalls -True Dependencies Store-to-Load Forwarding Restrictions Example 4 Avoid Example 3 Avoid Example 5 Preferred Example 6 AvoidExample 7 Avoid Summary of Store-to-Load Forwarding Pitfalls to Avoid Stack Alignment Considerations Align Tbyte Variables on Quadword Aligned Addresses Sort Variables According to Base Type Size Avoid Branches Dependent on Random Data Branch Optimizations Blended AMD-K6and AMD Athlon Processor Code AMD Athlon Processor Specific Code Always Pair Call and Return Example 6 Increment Ring Buffer OffsetExample 7 Integer Signum Function Muxing Constructs Replace Branches with Computation in 3DNow! Code Sample Code Translated into 3DNow! Code Code3DNow! code MM5 Pfsub Psrad Avoid Far Control Transfer Instructions Avoid the Loop Instruction Avoid Recursive Functions Complete Loop Unrolling Scheduling OptimizationsSchedule Instructions According to their Latency Unrolling Loops Partial Loop Unrolling With Partial Loop Unrolling Without Loop Unrolling Example 1 rolled loop Deriving LoopControl For Partially Unrolled Loops Overview Use Function Inlining Always Inline Functions if Called from One Site Avoid Address Generation Interlocks Minimize Pointer Arithmetic in Loops Use Movzx and Movsx MOV ECX, Maxsize Example 3 Preferred Push Memory Data Carefully Push Memory Data Carefully Integer Optimizations Replace Divides with MultipliesMultiplication by Reciprocal Division Utility Unsigned Division by Multiplication of Constant Signed Division by Multiplication of Constant Simpler Code forRestricted Dividend Integer 2 or Signed Division BySigned Division By 2n Remainder of Signed Integer 2n or -2n Use Alternative Code When Multiplying by a Constant ADD REG1, REG1 REG1, REG2 SHL Use MMX Instructions for Integer-Only Work Repeated String Instruction Usage Latency of Repeated String InstructionsGuidelines for Repeated String Instructions Destination with Using MovqEnsure DF=0 UP Align Source Efficient 64-Bit Integer Arithmetic Use XOR Instruction to Clear Integer Registers Example 4 Left shift Example 5 Right shiftExample 6 Multiplication EBX, ESP+12 Dividendlo Example 7 Division Example 8 Remainder SHR EDX Step Efficient Implementation of Population Count Function Bit field. Thus the following computation is performed MOV EDX, EDX SHR Derivation of Multiplier Used for Integer Division by Utility sdiv.exe was compiled using the following code MOV ECX EDX Imul ADD SHR SAR Floating-Point Optimizations Ensure All FPU Data is AlignedUse Multiplies Rather than Divides Use Ffreep Macro to Pop One Register from the FPU Stack Floating-Point Compare Instructions Use the Fxch Instruction Rather than FST/FLD Pairs Avoid Using Extended-Precision Data Example 1 Fast Minimize Floating-Point-to-Integer Conversions Example 2 Potentially faster Example 4 Fastest Example 3 Potentially faster Floating-Point Subexpression Elimination 104 Take Advantage of the Fsincos Instruction 106 3DNow! and MMX Optimizations Use 3DNow! InstructionsUse Femms Instruction Use 3DNow! Instructions for Fast Division Optimized 14-Bit Precision DivideOptimized Full 24-Bit Precision Divide Newton-Raphson Reciprocal Pipelined Pair of 24-Bit Precision Divides Optimized 24-Bit Precision Square Root Optimized 15-Bit Precision Square Root Newton-Raphson Reciprocal Square Root Blended Code 3DNow! and MMX Intra-Operand SwappingAMD Athlon Specific Code Use MMX Pxor to Negate 3DNow! Data Fast Conversion of Signed Words to Floating-Point Positive Use MMX Pcmp Instead of 3DNow! PfcmpBoth Numbers Positive One Negative, One AMD-K6and AMD Athlon Processor Blended Code Use MMX Instructions for Block Copies and Block Fills 116 Processor Specific Use MMX Pxor to Clear All Bits in an MMX Register Use MMX Pcmpeqd to Set All Bits in an MMX Register Optimized Matrix Multiplication MOV EBX, RES Optimized Matrix Multiplication 121 Data Use 3DNow! Pavgusb for MPEG-2 Motion Compensation MM0=QWORD1 Stream of Packed Unsigned Bytes Complex Number Arithmetic Short Forms General x86 Optimization Guidelines Dependencies Register OperandsStack Allocation Introduction AMD Athlon Processor Microarchitecture Superscalar Processor AMD Athlon Processor Microarchitecture AMD Athlon Processor Microarchitecture 131 Branch Prediction Predecode Early Decoding Data Cache Instruction Control Unit Integer Execution Unit Integer Scheduler Floating-Point Scheduler 12 to Floating-Point Execution Unit Load/Store Unit Load-Store Unit LSU L2 Cache Controller Write CombiningAMD Athlon System Bus 140 AMD Athlon Processor Microarchitecture Fetch and Decode Pipeline Stages Pipeline and Execution Unit Resources Overview C T L R O M Cycle 3 VectorPath Cycle 1-FETCHCycle 2-SCAN Cycle 3 DirectPath Integer Pipeline Stages Integer Pipeline Stages Cycle 10 -DCACC Cycle 7-SCHEDCycle 8-EXEC Cycle 9-ADDGEN Floating-Point Pipeline Stages Floating-Point Pipeline Stages Cycle 10 -SCHED Cycle 7-STKRENCycle 8-REGREN Cycle 9-SCHEDW Results Execution Unit ResourcesTerminology Operands Integer Pipeline Operations Integer Pipeline Operation TypesInteger Decode Types Floating-Point Pipeline Operations Floating-Point Pipeline Operation TypesFloating-Point Decode Types Load/Store Pipeline Operations Load/Store Unit StagesStage 1 Cycle Stage 2 Cycle Stage 3 Cycle Code Sample Analysis Imul EAX, ECX INC ESI MOV ADD EDI, EBX SHLINC EBX ADD ESI, EDX Sample 2 Integer Register and Memory Load Operations DEC EDX MOV EDI, ECX SUBSAR Implementation Write Combining Appendix C What is Write Combining? Write-Combining Definitions and AbbreviationsProgramming Details Write-Combining Operations INIT, Halt Write Combining Completion Events Sending Write-Buffer Data to the System AMD Athlon System Bus Commands Generation Rules 160 Write-Combining Operations Performance Counter Usage Performance-Monitoring Counters PerfEvtSel30 Registers PerfEvtSel30 MSRs MSR Addresses C0010000h-C0010003h Performance Counter Usage 163 Performance-Monitoring Counters 74h 65h73h Snoop hits Instruction cache misses Event Source Event DescriptionWaited to use the L2 Instruction cache fetches PerfCtr30 MSRs MSR Addresses C0010004h-C0010007h Starting and Stopping the Performance-Monitoring Counters Event and Time-Stamp Monitoring Software Monitoring Counter Overflow 170 Monitoring Counter Overflow Memory Type Range Register Mtrr Mechanism Programming the Mtrr 172 Memory Type Range Register Mtrr Mechanism Fixed Ranges FFFFFFFFh100000h Kbytes each Fixed Ranges C0000h 80000h Register Format Memory TypesMemory Type Encodings Mtrr Capability Memory Type Range Register Mtrr Mechanism 175 Standard Mtrr Types and Properties Attribute Table MSR 277h Attribute Table PAT PATi PAT Entry Reset ValuePATi 3-Bit Encodings MTRRs and PAT PAT Memory Type Mtrr Memory Type Effective Memory Type Based on PAT and MTRRs Input Memory Type Final Output Memory Types Attribute Table PAT 181 C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF 7FFFF 6FFFF 5FFFF 4FFFF 3FFFF 2FFFF 1FFFF 0FFFF9FFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF Bffff Bbfff B7FFF B3FFF Affff Abfff A7FFF A3FFF Attribute Table PAT 183 MTRRphysMaskn Register Format MTRR-Related Model-Specific Register MSR Map 186 Instruction Dispatch Execution Resources Appendix F AAM Integer InstructionsAAA AAD ADC mreg16/32, reg16/32 ModR/M Decode ByteADC mreg8, reg8 ADC mem8, reg8 Bswap EDX BoundBswap EAX Bswap ECX CLI CBW/CWDECLC CLD CMOVG/CMOVNLE reg16/32, mem16/32 0Fh CMOVE/CMOVZ reg16/32, reg16/32 0FhCMOVE/CMOVZ reg16/32, mem16/32 0Fh CMOVG/CMOVNLE reg16/32, reg16/32 0Fh DAS CpuidCWD/CDQ DAA EAX, DX EnterAL, DX AX, DX Invlpg Invd Leave Lahf LSL reg16/32, mem16/32 0Fh 03h LOOPE/LOOPZ disp8 E1hLOOPNE/LOOPNZ disp8 E0h LSL reg16/32, mreg16/32 0Fh 03h NOP Xchg EAX, EAX POP ES OUT DX, ALOUT DX, AX OUT DX, EAX POP ESI POP EBXPOP ESP POP EBP Rdmsr RdpmcRdtsc Sahf SBB reg8, mem8 SBB mreg16/32, reg16/32SBB mem16/32, reg16/32 SBB reg8, mreg8 Setns mem8 Sets mreg8Sets mem8 Setns mreg8 STC STDSTI Sysret SyscallSysenter Sysexit Xchg EAX, EBX Xchg EAX, EAXXchg EAX, ECX Xchg EAX, EDX Emms MMX Instructions Pcmpeqb mmreg1, mmreg2 Pandn mmreg1, mmreg2DFh Pandn mmreg, mem64 FADD/FMUL FPU MMX Extensions Floating-Point Instructions Fcompp FcosFdecstp Fincstp FinitFLD1 FLDLN2 FLDL2EFLDL2T FLDLG2 Fucomp Fstsw AXFtst Fucom Femms DNow! Instructions DNow! Extensions DirectPath Instructions DirectPath versus VectorPath Instructions DirectPath Integer Instructions BT mreg16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8CBW/CWDE CLC CMC INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 JO short disp8 DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32 222 DirectPath Instructions DirectPath Instructions 223 224 DirectPath Instructions Wait Xchg EAX, EAX 226 DirectPath Instructions DirectPath MMX Instructions DirectPath MMX Extensions FLD1 FLDL2E FLDL2T FLDLG2 FLDLN2 Fldpi Fldz DirectPath Floating-Point InstructionsFcompp Fdecstp Fist mem16int Ftst Fucom Fucomp Fucompp Fwait Fxch CLD CLI Clts VectorPath InstructionsVectorPath Integer Instructions AAA AAD AAM AAS AL, DX AX, DX EAX, DX Invd Invlpg Instruction Mnemonic DIV EAX, mem16/32 Rdmsr Rdpmc Rdtsc POP mreg 16/32 POP mem 16/32Push mreg16/32 Push mem16/32 PUSHA/PUSHAD PUSHF/PUSHFD Wbinvd Wrmsr VectorPath MMX InstructionsVectorPath MMX Extensions Syscall Sysenter Sysexit Sysret Fxam Fxtract FYL2X FYL2XP1 VectorPath Floating-Point InstructionsFptan Fpatan Frndint Fscale Fsin Fsincos 236 Index 238 Index Index 239 240 Index