Manuals / AMD / Computer Equipment / Typewriter

AMD x86 manual Deriving Loop, Control For Partially, Unrolled Loops, Example 1 rolled loop

Models: x86

1 256

Download 256 pages 58.62 Kb

Page 86

AMD Athlon™ Processor x86 Code Optimization

22007E/0 — November 1999

	no fa st er t h an three it era t ions in 1 0 cy c l es, or 6/ 10
	floating-point adds per cycle, or 1.4 times as fast as the original
	loop.
Deriving Loop	A frequently used loop construct is a counting loop. In a typical
Control For Partially	case, the loop count starts at some lower bound lo, increases by
Unrolled Loops	some fixed, positive increment inc for each iteration of the
	loop, and may not exceed some upper bound hi. The following
	example shows how to partially unroll such a loop by an
	unrolling factor of fac, and how to derive the loop control for
	the partially unrolled version of the loop.
	Example 1 (rolled loop):
	for (k = lo; k <= hi; k += inc) {
	x[k] =
	...
	}
	Example 2 (partially unrolled loop):
	for (k = lo; k <= (hi - (fac-1)inc); k += facinc) {
	x[k] =
	...
	x[k+inc] =
	...
	...
	x[k+(fac-1)*inc] =
	...
	}
	/* handle end cases */
	for (k = k; k <= hi; k += inc) {
	x[k] =
	...
	}

70	Unrolling Loops

Image 86

AMD x86 Deriving Loop, Control For Partially, Unrolled Loops, Example 1 rolled loop, Example 2 partially unrolled loop

Contents

AMD AthlonTM Processor Trademarks Contents Instruction Decoding Optimizations Cache and Memory Optimizations Scheduling Optimizations Floating-Point Optimizations General x86 Optimization Guidelines 127 Appendix B Pipeline and Execution Unit Resources Overview Appendix E Programming the Mtrr and PAT List of Figures Xii List of Tables Xiv Revision History Xvi Introduction About this Document Source Level Optimizations. Describes optimizations that AMD Athlon Processor Family AMD Athlon Processor Microarchitecture Summary AMD Athlon Processor Microarchitecture Summary AMD Athlon Processor x86 Code Optimization Top Optimizations Group I Optimizations Essential Optimizations Memory Size and Alignment IssuesOptimization Star Use the 3DNow! Prefetch and Prefetchw Instructions Load-Execute Instruction Usage Group II Optimizations-Secondary OptimizationsSelect DirectPath Over VectorPath Instructions Use Load-Execute Instructions Avoid Branches Dependent on Random Data Take Advantage of Write CombiningUse 3DNow! Instructions Avoid Placing Code and Data in the Same 64-Byte Cache Line 22007E/0 November Source Level Optimizations Use 32-Bit Data Types for Integer Code Example Preferred Consider the Sign of Integer OperandsExample 1 Avoid Use unsigned types for Use Array Style Instead of Pointer Style CodeExample Avoid Use signed types for Vertex Example 2 Preferred Completely Unroll Small Loops Avoid Unnecessary Store-to-Load Dependencies Avoid Unnecessary Store-to-Load Dependencies Consider Expression Order in Compound Branch Conditions Use Prototypes for All Functions Switch Statement UsageOptimize Switch Statements Example Use Const Type QualifierGeneric Loop Hoisting Generalization for Multiple Constant Control Code Declare Local Functions as Static Dynamic Memory Allocation Consideration Introduce Explicit Parallelism into Code Explicitly Extract Common Subexpressions Avoid Language Structure Component ConsiderationsExample Preferred New ordering, with padding Preferred Sort Local Variables According to Base Type SizeOriginal ordering Avoid Accelerating Floating-Point Divides and Square Roots Improved ordering Preferred Accelerating Floating-Point Divides and Square Roots Avoid Unnecessary Integer Division AMD Athlon Processor x86 Code Optimization Instruction Decoding Optimizations Overview Use Load-Execute Integer Instructions Select DirectPath Over VectorPath InstructionsLoad-Execute Instruction Usage TOP Align Branch Targets in Program Hot Spots Use Short Instruction Lengths Avoid Partial Register Reads and Writes Example 2 Avoid Use 8-Bit Sign-Extended Immediates Replace Certain Shld Instructions with Alternative Code Use 8-Bit Sign-Extended Displacements Code Padding Using Neutral Code Fillers Recommendations for the AMD Athlon Processor Code Padding Using Neutral Code Fillers AMD Athlon Processor x86 Code Optimization NOP6EDI AMD Athlon Processor x86 Code Optimization Avoid Memory Size Mismatches Memory Size and Alignment IssuesCache and Memory Optimizations Align Data Where Possible Use the 3DNow! Prefetch and Prefetchw Instructions Example Multiple Prefetches Code MOV ECX, -LARGENUM Determining Prefetch Distance Prefetch Distance = 200 DS/C bytes Take Advantage of Write Combining Avoid Placing Code and Data in the Same 64-Byte Cache Line Store-to-Load Forwarding Restrictions Store-to-Load Forwarding Pitfalls -True Dependencies Example 3 Avoid Example 4 Avoid Example 7 Avoid Example 5 PreferredExample 6 Avoid Stack Alignment Considerations Summary of Store-to-Load Forwarding Pitfalls to Avoid Align Tbyte Variables on Quadword Aligned Addresses Sort Variables According to Base Type Size Branch Optimizations Avoid Branches Dependent on Random Data AMD Athlon Processor Specific Code Blended AMD-K6and AMD Athlon Processor Code Example 7 Integer Signum Function Always Pair Call and ReturnExample 6 Increment Ring Buffer Offset Replace Branches with Computation in 3DNow! Code Muxing Constructs 3DNow! code Sample Code Translated into 3DNow! CodeCode MM5 Pfsub Psrad Avoid the Loop Instruction Avoid Far Control Transfer Instructions Avoid Recursive Functions Unrolling Loops Scheduling OptimizationsSchedule Instructions According to their Latency Complete Loop Unrolling Partial Loop Unrolling Without Loop Unrolling With Partial Loop Unrolling Unrolled Loops Deriving LoopControl For Partially Example 1 rolled loop Use Function Inlining Overview Avoid Address Generation Interlocks Always Inline Functions if Called from One Site Use Movzx and Movsx Minimize Pointer Arithmetic in Loops MOV ECX, Maxsize Push Memory Data Carefully Example 3 Preferred Push Memory Data Carefully Multiplication by Reciprocal Division Utility Integer OptimizationsReplace Divides with Multiplies Unsigned Division by Multiplication of Constant Restricted Dividend Signed Division by Multiplication of ConstantSimpler Code for Remainder of Signed Signed Division BySigned Division By 2n Integer 2 or Use Alternative Code When Multiplying by a Constant Integer 2n or -2n ADD REG1, REG1 REG1, REG2 SHL Use MMX Instructions for Integer-Only Work Guidelines for Repeated String Instructions Repeated String Instruction UsageLatency of Repeated String Instructions Align Source Using MovqEnsure DF=0 UP Destination with Use XOR Instruction to Clear Integer Registers Efficient 64-Bit Integer Arithmetic Example 6 Multiplication Example 4 Left shiftExample 5 Right shift Example 7 Division EBX, ESP+12 Dividendlo Example 8 Remainder SHR EDX Efficient Implementation of Population Count Function Step Bit field. Thus the following computation is performed MOV EDX, EDX SHR Derivation of Multiplier Used for Integer Division by Utility sdiv.exe was compiled using the following code MOV ECX EDX Imul ADD SHR SAR Use Multiplies Rather than Divides Floating-Point OptimizationsEnsure All FPU Data is Aligned Floating-Point Compare Instructions Use Ffreep Macro to Pop One Register from the FPU Stack Avoid Using Extended-Precision Data Use the Fxch Instruction Rather than FST/FLD Pairs Minimize Floating-Point-to-Integer Conversions Example 1 Fast Example 2 Potentially faster Example 3 Potentially faster Example 4 Fastest Floating-Point Subexpression Elimination 104 Take Advantage of the Fsincos Instruction 106 Use Femms Instruction 3DNow! and MMX OptimizationsUse 3DNow! Instructions Optimized Full 24-Bit Precision Divide Use 3DNow! Instructions for Fast DivisionOptimized 14-Bit Precision Divide Pipelined Pair of 24-Bit Precision Divides Newton-Raphson Reciprocal Optimized 15-Bit Precision Square Root Optimized 24-Bit Precision Square Root Newton-Raphson Reciprocal Square Root Specific Code 3DNow! and MMX Intra-Operand SwappingAMD Athlon Blended Code Fast Conversion of Signed Words to Floating-Point Use MMX Pxor to Negate 3DNow! Data Positive One Negative, One Use MMX Pcmp Instead of 3DNow! PfcmpBoth Numbers Positive Use MMX Instructions for Block Copies and Block Fills AMD-K6and AMD Athlon Processor Blended Code 116 Processor Specific Use MMX Pxor to Clear All Bits in an MMX Register Optimized Matrix Multiplication Use MMX Pcmpeqd to Set All Bits in an MMX Register MOV EBX, RES Optimized Matrix Multiplication 121 Data Use 3DNow! Pavgusb for MPEG-2 Motion Compensation MM0=QWORD1 Stream of Packed Unsigned Bytes Complex Number Arithmetic General x86 Optimization Guidelines Short Forms Stack Allocation DependenciesRegister Operands AMD Athlon Processor Microarchitecture Introduction AMD Athlon Processor Microarchitecture Superscalar Processor AMD Athlon Processor Microarchitecture 131 Predecode Branch Prediction Early Decoding Instruction Control Unit Data Cache Integer Scheduler Integer Execution Unit Floating-Point Scheduler Floating-Point Execution Unit 12 to Load-Store Unit LSU Load/Store Unit AMD Athlon System Bus L2 Cache ControllerWrite Combining 140 AMD Athlon Processor Microarchitecture Pipeline and Execution Unit Resources Overview Fetch and Decode Pipeline Stages C T L R O M Cycle 3 DirectPath Cycle 1-FETCHCycle 2-SCAN Cycle 3 VectorPath Integer Pipeline Stages Integer Pipeline Stages Cycle 9-ADDGEN Cycle 7-SCHEDCycle 8-EXEC Cycle 10 -DCACC Floating-Point Pipeline Stages Floating-Point Pipeline Stages Cycle 9-SCHEDW Cycle 7-STKRENCycle 8-REGREN Cycle 10 -SCHED Operands Execution Unit ResourcesTerminology Results Integer Decode Types Integer Pipeline OperationsInteger Pipeline Operation Types Floating-Point Decode Types Floating-Point Pipeline OperationsFloating-Point Pipeline Operation Types Stage 1 Cycle Stage 2 Cycle Stage 3 Cycle Load/Store Pipeline OperationsLoad/Store Unit Stages Code Sample Analysis INC EBX ADD ESI, EDX Imul EAX, ECX INC ESI MOVADD EDI, EBX SHL SAR Sample 2 Integer Register and Memory Load OperationsDEC EDX MOV EDI, ECX SUB Appendix C Implementation Write Combining Programming Details What is Write Combining?Write-Combining Definitions and Abbreviations Write-Combining Operations Write Combining Completion Events INIT, Halt AMD Athlon System Bus Commands Generation Rules Sending Write-Buffer Data to the System 160 Write-Combining Operations Performance-Monitoring Counters Performance Counter Usage PerfEvtSel30 MSRs MSR Addresses C0010000h-C0010003h PerfEvtSel30 Registers Performance Counter Usage 163 Performance-Monitoring Counters Snoop hits 65h73h 74h Instruction cache fetches Event Source Event DescriptionWaited to use the L2 Instruction cache misses PerfCtr30 MSRs MSR Addresses C0010004h-C0010007h Event and Time-Stamp Monitoring Software Starting and Stopping the Performance-Monitoring Counters Monitoring Counter Overflow 170 Monitoring Counter Overflow Programming the Mtrr Memory Type Range Register Mtrr Mechanism 172 Memory Type Range Register Mtrr Mechanism 100000h Kbytes each Fixed Ranges C0000h 80000h Fixed RangesFFFFFFFFh Mtrr Capability Memory TypesMemory Type Encodings Register Format Memory Type Range Register Mtrr Mechanism 175 Standard Mtrr Types and Properties Attribute Table PAT Attribute Table MSR 277h MTRRs and PAT PAT Entry Reset ValuePATi 3-Bit Encodings PATi Effective Memory Type Based on PAT and MTRRs PAT Memory Type Mtrr Memory Type Final Output Memory Types Input Memory Type Attribute Table PAT 181 Bffff Bbfff B7FFF B3FFF Affff Abfff A7FFF A3FFF 7FFFF 6FFFF 5FFFF 4FFFF 3FFFF 2FFFF 1FFFF 0FFFF9FFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF Attribute Table PAT 183 MTRRphysMaskn Register Format MTRR-Related Model-Specific Register MSR Map 186 Appendix F Instruction Dispatch Execution Resources AAD Integer InstructionsAAA AAM ADC mem8, reg8 ModR/M Decode ByteADC mreg8, reg8 ADC mreg16/32, reg16/32 Bswap ECX BoundBswap EAX Bswap EDX CLD CBW/CWDECLC CLI CMOVG/CMOVNLE reg16/32, reg16/32 0Fh CMOVE/CMOVZ reg16/32, reg16/32 0FhCMOVE/CMOVZ reg16/32, mem16/32 0Fh CMOVG/CMOVNLE reg16/32, mem16/32 0Fh DAA CpuidCWD/CDQ DAS AX, DX EnterAL, DX EAX, DX Invd Invlpg Lahf Leave LSL reg16/32, mreg16/32 0Fh 03h LOOPE/LOOPZ disp8 E1hLOOPNE/LOOPNZ disp8 E0h LSL reg16/32, mem16/32 0Fh 03h NOP Xchg EAX, EAX OUT DX, EAX OUT DX, ALOUT DX, AX POP ES POP EBP POP EBXPOP ESP POP ESI Rdtsc RdmsrRdpmc Sahf SBB reg8, mreg8 SBB mreg16/32, reg16/32SBB mem16/32, reg16/32 SBB reg8, mem8 Setns mreg8 Sets mreg8Sets mem8 Setns mem8 STI STCSTD Sysexit SyscallSysenter Sysret Xchg EAX, EDX Xchg EAX, EAXXchg EAX, ECX Xchg EAX, EBX MMX Instructions Emms Pandn mmreg, mem64 Pandn mmreg1, mmreg2DFh Pcmpeqb mmreg1, mmreg2 FADD/FMUL MMX Extensions FPU Floating-Point Instructions Fdecstp FcomppFcos FLD1 FincstpFinit FLDLG2 FLDL2EFLDL2T FLDLN2 Fucom Fstsw AXFtst Fucomp DNow! Instructions Femms DNow! Extensions DirectPath versus VectorPath Instructions DirectPath Instructions CBW/CWDE CLC CMC DirectPath Integer InstructionsBT mreg16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8 DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32 INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 JO short disp8 222 DirectPath Instructions DirectPath Instructions 223 224 DirectPath Instructions Wait Xchg EAX, EAX 226 DirectPath Instructions DirectPath MMX Instructions DirectPath MMX Extensions Fist mem16int DirectPath Floating-Point InstructionsFcompp Fdecstp FLD1 FLDL2E FLDL2T FLDLG2 FLDLN2 Fldpi Fldz Ftst Fucom Fucomp Fucompp Fwait Fxch AAA AAD AAM AAS VectorPath InstructionsVectorPath Integer Instructions CLD CLI Clts Instruction Mnemonic DIV EAX, mem16/32 AL, DX AX, DX EAX, DX Invd Invlpg PUSHA/PUSHAD PUSHF/PUSHFD POP mreg 16/32 POP mem 16/32Push mreg16/32 Push mem16/32 Rdmsr Rdpmc Rdtsc Syscall Sysenter Sysexit Sysret VectorPath MMX InstructionsVectorPath MMX Extensions Wbinvd Wrmsr Fscale Fsin Fsincos VectorPath Floating-Point InstructionsFptan Fpatan Frndint Fxam Fxtract FYL2X FYL2XP1 236 Index 238 Index Index 239 240 Index