Optimized Matrix Multiplication 121 | AMD x86 specification

22007E/0 — November 1999

AMD Athlon™ Processor x86 Code Optimization

$$xform:
ADD	EBX, 16	;res++
MOVQ	MM0, QWORD PTR [EDX]	;v->y v->x
MOVQ	MM1, QWORD PTR [EDX+8]	;v->w v->z
ADD	EDX, 16	;v++
MOVQ	MM2, MM0	;v->y v->x
MOVQ	MM3, QWORD PTR [EAX+M00]	;m[0][1] m[0][0]
PUNPCKLDQ	MM0, MM0	;v->x v->x
MOVQ	MM4, QWORD PTR [EAX+M10]	;m[1][1] m[1][0]
PFMUL	MM3, MM0	;v->xm[0][1] v->xm[0][0]
PUNPCKHDQ	MM2, MM2	;v->y v->y
PFMUL	MM4, MM2	;v->ym[1][1] v->ym[1][0]
MOVQ	MM5, QWORD PTR [EAX+M02]	;m[0][3] m[0][2]
MOVQ	MM7, QWORD PTR [EAX+M12]	;m[1][3] m[1][2]
MOVQ	MM6, MM1	;v->w v->z
PFMUL	MM5, MM0	;v->xm[0][3] v0>xm[0][2]
MOVQ	MM0, QWORD PTR [EAX+M20]	;m[2][1] m[2][0]
PUNPCKLDQ	MM1, MM1	;v->z v->z
PFMUL	MM7, MM2	;v->ym[1][3] v->ym[1][2]
MOVQ	MM2, QWORD PTR [EAX+M22]	;m[2][3] m[2][2]
PFMUL	MM0, MM1	;v->zm[2][1] v->zm[2][0]
PFADD	MM3, MM4	;v->xm[0][1]+v->ym[1][1]
		; v->xm[0][0]+v->ym[1][0]
MOVQ	MM4, QWORD PTR [EAX+M30]	;m[3][1] m[3][0]
PFMUL	MM2, MM1	;v->zm[2][3] v->zm[2][2]
PFADD	MM5, MM7	;v->xm[0][3]+v->ym[1][3]
		; v->xm[0][2]+v->ym[1][2]
MOVQ	MM1, QWORD PTR [EAX+M32]	;m[3][3] m[3][2]
PUNPCKHDQ MM6, MM6		;v->w v->w

PFADD	MM3, MM0	;v->xm[0][1]+v->ym[1][1]+v->z*m[2][1]
		; v->xm[0][0]+v->ym[1][0]+v->z*m[2][0]
PFMUL	MM4, MM6	;v->wm[3][1] v->wm[3][0]
PFMUL	MM1, MM6	;v->wm[3][3] v->wm[3][2]
PFADD	MM5, MM2	;v->xm[0][3]+v->ym[1][3]+v->z*m[2][3]
		; v->xm[0][2]+v->ym[1][2]+v->z*m[2][2]
PFADD	MM3, MM4	;v->xm[0][1]+v->ym[1][1]+v->z*m[2][1]+
		; v->wm[3][1] v->xm[0][0]+v->y*m[1][0]+
		; v->zm[2][0]+v->wm[3][0]
MOVQ	[EBX-16], MM3	;store res->y res->x
PFADD	MM5, MM1	;v->xm[0][3]+v->ym[1][3]+v->z*m[2][3]+
		; v->wm[3][3] v->xm[0][2]+v->y*m[1][2]+
		; v->zm[2][2]+v->wm[3][2]
MOVQ	[EBX-8], MM5	;store res->w res->z
DEC	ECX	;numverts--
JNZ	$$XFORM	;until numverts == 0
FEMMS		;clear MMX state

}

}

Optimized Matrix Multiplication

121

Image 137

AMD x86 manual Optimized Matrix Multiplication 121

Contents

AMD AthlonTM Processor Trademarks Contents Instruction Decoding Optimizations Cache and Memory Optimizations Scheduling Optimizations Floating-Point Optimizations General x86 Optimization Guidelines 127 Appendix B Pipeline and Execution Unit Resources Overview Appendix E Programming the Mtrr and PAT List of Figures Xii List of Tables Xiv Revision History Xvi About this Document Introduction Source Level Optimizations. Describes optimizations that AMD Athlon Processor Family AMD Athlon Processor Microarchitecture Summary AMD Athlon Processor Microarchitecture Summary AMD Athlon Processor x86 Code Optimization Top Optimizations Optimization Star Memory Size and Alignment IssuesGroup I Optimizations Essential Optimizations Use the 3DNow! Prefetch and Prefetchw Instructions Select DirectPath Over VectorPath Instructions Group II Optimizations-Secondary OptimizationsLoad-Execute Instruction Usage Use Load-Execute Instructions Avoid Branches Dependent on Random Data Take Advantage of Write CombiningUse 3DNow! Instructions Avoid Placing Code and Data in the Same 64-Byte Cache Line 22007E/0 November Use 32-Bit Data Types for Integer Code Source Level Optimizations Example Preferred Consider the Sign of Integer OperandsExample 1 Avoid Example Avoid Use Array Style Instead of Pointer Style CodeUse unsigned types for Use signed types for Vertex Example 2 Preferred Avoid Unnecessary Store-to-Load Dependencies Completely Unroll Small Loops Avoid Unnecessary Store-to-Load Dependencies Consider Expression Order in Compound Branch Conditions Use Prototypes for All Functions Switch Statement UsageOptimize Switch Statements Example Use Const Type QualifierGeneric Loop Hoisting Generalization for Multiple Constant Control Code Declare Local Functions as Static Introduce Explicit Parallelism into Code Dynamic Memory Allocation Consideration Explicitly Extract Common Subexpressions Example Language Structure Component ConsiderationsAvoid Preferred New ordering, with padding Preferred Sort Local Variables According to Base Type SizeOriginal ordering Avoid Improved ordering Preferred Accelerating Floating-Point Divides and Square Roots Accelerating Floating-Point Divides and Square Roots Avoid Unnecessary Integer Division AMD Athlon Processor x86 Code Optimization Overview Instruction Decoding Optimizations Use Load-Execute Integer Instructions Select DirectPath Over VectorPath InstructionsLoad-Execute Instruction Usage TOP Use Short Instruction Lengths Align Branch Targets in Program Hot Spots Example 2 Avoid Avoid Partial Register Reads and Writes Replace Certain Shld Instructions with Alternative Code Use 8-Bit Sign-Extended Immediates Code Padding Using Neutral Code Fillers Use 8-Bit Sign-Extended Displacements Recommendations for the AMD Athlon Processor Code Padding Using Neutral Code Fillers AMD Athlon Processor x86 Code Optimization NOP6EDI AMD Athlon Processor x86 Code Optimization Avoid Memory Size Mismatches Memory Size and Alignment IssuesCache and Memory Optimizations Use the 3DNow! Prefetch and Prefetchw Instructions Align Data Where Possible Code Example Multiple Prefetches MOV ECX, -LARGENUM Prefetch Distance = 200 DS/C bytes Determining Prefetch Distance Avoid Placing Code and Data in the Same 64-Byte Cache Line Take Advantage of Write Combining Store-to-Load Forwarding Pitfalls -True Dependencies Store-to-Load Forwarding Restrictions Example 4 Avoid Example 3 Avoid Example 7 Avoid Example 5 PreferredExample 6 Avoid Summary of Store-to-Load Forwarding Pitfalls to Avoid Stack Alignment Considerations Align Tbyte Variables on Quadword Aligned Addresses Sort Variables According to Base Type Size Avoid Branches Dependent on Random Data Branch Optimizations Blended AMD-K6and AMD Athlon Processor Code AMD Athlon Processor Specific Code Example 7 Integer Signum Function Always Pair Call and ReturnExample 6 Increment Ring Buffer Offset Muxing Constructs Replace Branches with Computation in 3DNow! Code 3DNow! code Sample Code Translated into 3DNow! CodeCode MM5 Pfsub Psrad Avoid Far Control Transfer Instructions Avoid the Loop Instruction Avoid Recursive Functions Schedule Instructions According to their Latency Scheduling OptimizationsUnrolling Loops Complete Loop Unrolling Partial Loop Unrolling With Partial Loop Unrolling Without Loop Unrolling Control For Partially Deriving LoopUnrolled Loops Example 1 rolled loop Overview Use Function Inlining Always Inline Functions if Called from One Site Avoid Address Generation Interlocks Minimize Pointer Arithmetic in Loops Use Movzx and Movsx MOV ECX, Maxsize Example 3 Preferred Push Memory Data Carefully Push Memory Data Carefully Multiplication by Reciprocal Division Utility Integer OptimizationsReplace Divides with Multiplies Unsigned Division by Multiplication of Constant Restricted Dividend Signed Division by Multiplication of ConstantSimpler Code for Signed Division By 2n Signed Division ByRemainder of Signed Integer 2 or Integer 2n or -2n Use Alternative Code When Multiplying by a Constant ADD REG1, REG1 REG1, REG2 SHL Use MMX Instructions for Integer-Only Work Guidelines for Repeated String Instructions Repeated String Instruction UsageLatency of Repeated String Instructions Ensure DF=0 UP Using MovqAlign Source Destination with Efficient 64-Bit Integer Arithmetic Use XOR Instruction to Clear Integer Registers Example 6 Multiplication Example 4 Left shiftExample 5 Right shift EBX, ESP+12 Dividendlo Example 7 Division Example 8 Remainder SHR EDX Step Efficient Implementation of Population Count Function Bit field. Thus the following computation is performed MOV EDX, EDX SHR Derivation of Multiplier Used for Integer Division by Utility sdiv.exe was compiled using the following code MOV ECX EDX Imul ADD SHR SAR Use Multiplies Rather than Divides Floating-Point OptimizationsEnsure All FPU Data is Aligned Use Ffreep Macro to Pop One Register from the FPU Stack Floating-Point Compare Instructions Use the Fxch Instruction Rather than FST/FLD Pairs Avoid Using Extended-Precision Data Example 1 Fast Minimize Floating-Point-to-Integer Conversions Example 2 Potentially faster Example 4 Fastest Example 3 Potentially faster Floating-Point Subexpression Elimination 104 Take Advantage of the Fsincos Instruction 106 Use Femms Instruction 3DNow! and MMX OptimizationsUse 3DNow! Instructions Optimized Full 24-Bit Precision Divide Use 3DNow! Instructions for Fast DivisionOptimized 14-Bit Precision Divide Newton-Raphson Reciprocal Pipelined Pair of 24-Bit Precision Divides Optimized 24-Bit Precision Square Root Optimized 15-Bit Precision Square Root Newton-Raphson Reciprocal Square Root AMD Athlon 3DNow! and MMX Intra-Operand SwappingSpecific Code Blended Code Use MMX Pxor to Negate 3DNow! Data Fast Conversion of Signed Words to Floating-Point Both Numbers Use MMX Pcmp Instead of 3DNow! PfcmpPositive One Negative, One Positive AMD-K6and AMD Athlon Processor Blended Code Use MMX Instructions for Block Copies and Block Fills 116 Processor Specific Use MMX Pxor to Clear All Bits in an MMX Register Use MMX Pcmpeqd to Set All Bits in an MMX Register Optimized Matrix Multiplication MOV EBX, RES Optimized Matrix Multiplication 121 Data Use 3DNow! Pavgusb for MPEG-2 Motion Compensation MM0=QWORD1 Stream of Packed Unsigned Bytes Complex Number Arithmetic Short Forms General x86 Optimization Guidelines Stack Allocation DependenciesRegister Operands Introduction AMD Athlon Processor Microarchitecture Superscalar Processor AMD Athlon Processor Microarchitecture AMD Athlon Processor Microarchitecture 131 Branch Prediction Predecode Early Decoding Data Cache Instruction Control Unit Integer Execution Unit Integer Scheduler Floating-Point Scheduler 12 to Floating-Point Execution Unit Load/Store Unit Load-Store Unit LSU AMD Athlon System Bus L2 Cache ControllerWrite Combining 140 AMD Athlon Processor Microarchitecture Fetch and Decode Pipeline Stages Pipeline and Execution Unit Resources Overview C T L R O M Cycle 2-SCAN Cycle 1-FETCHCycle 3 DirectPath Cycle 3 VectorPath Integer Pipeline Stages Integer Pipeline Stages Cycle 8-EXEC Cycle 7-SCHEDCycle 9-ADDGEN Cycle 10 -DCACC Floating-Point Pipeline Stages Floating-Point Pipeline Stages Cycle 8-REGREN Cycle 7-STKRENCycle 9-SCHEDW Cycle 10 -SCHED Terminology Execution Unit ResourcesOperands Results Integer Decode Types Integer Pipeline OperationsInteger Pipeline Operation Types Floating-Point Decode Types Floating-Point Pipeline OperationsFloating-Point Pipeline Operation Types Stage 1 Cycle Stage 2 Cycle Stage 3 Cycle Load/Store Pipeline OperationsLoad/Store Unit Stages Code Sample Analysis INC EBX ADD ESI, EDX Imul EAX, ECX INC ESI MOVADD EDI, EBX SHL SAR Sample 2 Integer Register and Memory Load OperationsDEC EDX MOV EDI, ECX SUB Implementation Write Combining Appendix C Programming Details What is Write Combining?Write-Combining Definitions and Abbreviations Write-Combining Operations INIT, Halt Write Combining Completion Events Sending Write-Buffer Data to the System AMD Athlon System Bus Commands Generation Rules 160 Write-Combining Operations Performance Counter Usage Performance-Monitoring Counters PerfEvtSel30 Registers PerfEvtSel30 MSRs MSR Addresses C0010000h-C0010003h Performance Counter Usage 163 Performance-Monitoring Counters 73h 65hSnoop hits 74h Waited to use the L2 Event Source Event DescriptionInstruction cache fetches Instruction cache misses PerfCtr30 MSRs MSR Addresses C0010004h-C0010007h Starting and Stopping the Performance-Monitoring Counters Event and Time-Stamp Monitoring Software Monitoring Counter Overflow 170 Monitoring Counter Overflow Memory Type Range Register Mtrr Mechanism Programming the Mtrr 172 Memory Type Range Register Mtrr Mechanism 100000h Kbytes each Fixed Ranges C0000h 80000h Fixed RangesFFFFFFFFh Memory Type Encodings Memory TypesMtrr Capability Register Format Memory Type Range Register Mtrr Mechanism 175 Standard Mtrr Types and Properties Attribute Table MSR 277h Attribute Table PAT PATi 3-Bit Encodings PAT Entry Reset ValueMTRRs and PAT PATi PAT Memory Type Mtrr Memory Type Effective Memory Type Based on PAT and MTRRs Input Memory Type Final Output Memory Types Attribute Table PAT 181 9FFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF 7FFFF 6FFFF 5FFFF 4FFFF 3FFFF 2FFFF 1FFFF 0FFFFBffff Bbfff B7FFF B3FFF Affff Abfff A7FFF A3FFF C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF Attribute Table PAT 183 MTRRphysMaskn Register Format MTRR-Related Model-Specific Register MSR Map 186 Instruction Dispatch Execution Resources Appendix F AAA Integer InstructionsAAD AAM ADC mreg8, reg8 ModR/M Decode ByteADC mem8, reg8 ADC mreg16/32, reg16/32 Bswap EAX BoundBswap ECX Bswap EDX CLC CBW/CWDECLD CLI CMOVE/CMOVZ reg16/32, mem16/32 0Fh CMOVE/CMOVZ reg16/32, reg16/32 0FhCMOVG/CMOVNLE reg16/32, reg16/32 0Fh CMOVG/CMOVNLE reg16/32, mem16/32 0Fh CWD/CDQ CpuidDAA DAS AL, DX EnterAX, DX EAX, DX Invlpg Invd Leave Lahf LOOPNE/LOOPNZ disp8 E0h LOOPE/LOOPZ disp8 E1hLSL reg16/32, mreg16/32 0Fh 03h LSL reg16/32, mem16/32 0Fh 03h NOP Xchg EAX, EAX OUT DX, AX OUT DX, ALOUT DX, EAX POP ES POP ESP POP EBXPOP EBP POP ESI Rdtsc RdmsrRdpmc Sahf SBB mem16/32, reg16/32 SBB mreg16/32, reg16/32SBB reg8, mreg8 SBB reg8, mem8 Sets mem8 Sets mreg8Setns mreg8 Setns mem8 STI STCSTD Sysenter SyscallSysexit Sysret Xchg EAX, ECX Xchg EAX, EAXXchg EAX, EDX Xchg EAX, EBX Emms MMX Instructions DFh Pandn mmreg1, mmreg2Pandn mmreg, mem64 Pcmpeqb mmreg1, mmreg2 FADD/FMUL FPU MMX Extensions Floating-Point Instructions Fdecstp FcomppFcos FLD1 FincstpFinit FLDL2T FLDL2EFLDLG2 FLDLN2 Ftst Fstsw AXFucom Fucomp Femms DNow! Instructions DNow! Extensions DirectPath Instructions DirectPath versus VectorPath Instructions CBW/CWDE CLC CMC DirectPath Integer InstructionsBT mreg16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8 INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 JO short disp8 DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32 222 DirectPath Instructions DirectPath Instructions 223 224 DirectPath Instructions Wait Xchg EAX, EAX 226 DirectPath Instructions DirectPath MMX Instructions DirectPath MMX Extensions Fcompp Fdecstp DirectPath Floating-Point InstructionsFist mem16int FLD1 FLDL2E FLDL2T FLDLG2 FLDLN2 Fldpi Fldz Ftst Fucom Fucomp Fucompp Fwait Fxch VectorPath Integer Instructions VectorPath InstructionsAAA AAD AAM AAS CLD CLI Clts AL, DX AX, DX EAX, DX Invd Invlpg Instruction Mnemonic DIV EAX, mem16/32 Push mreg16/32 Push mem16/32 POP mreg 16/32 POP mem 16/32PUSHA/PUSHAD PUSHF/PUSHFD Rdmsr Rdpmc Rdtsc VectorPath MMX Extensions VectorPath MMX InstructionsSyscall Sysenter Sysexit Sysret Wbinvd Wrmsr Fptan Fpatan Frndint VectorPath Floating-Point InstructionsFscale Fsin Fsincos Fxam Fxtract FYL2X FYL2XP1 236 Index 238 Index Index 239 240 Index