319, Fadd/Fmul | AMD 250 manual

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Table 18. SSE Instructions (Continued)

			Encoding			Decode
	Syntax						FPU pipe(s)	Latency	Note
		Prefix	First	2nd
					ModRM byte	type

		byte	byte	byte
		byte	byte	byte

	DIVPS xmmreg,	0Fh	5Eh		mm-xxx-xxx	Double	FMUL	35
	mem128

	DIVSS xmmreg1,	F3h	0Fh	5Eh	11-xxx-xxx	DirectPath	FMUL	16
	xmmreg2

	DIVSS xmmreg, mem32	F3h	0Fh	5Eh	mm-xxx-xxx	DirectPath	FMUL	18

	LDMXCSR mem32	0Fh	AEh		mm-010-xxx	VectorPath		13	4

	MASKMOVQ mmreg1,	0Fh	F7h		11-xxx-xxx	VectorPath	FADD/FMUL/	29
	mmreg2						FSTORE

	MAXPS xmmreg1,	0Fh	5Fh		11-xxx-xxx	Double	FADD	3	1
	xmmreg2

	MAXPS xmmreg,	0Fh	5Fh		mm-xxx-xxx	Double	FADD	5	1
	mem128

	MAXSS xmmreg1,	F3h	0Fh	5Fh	11-xxx-xxx	DirectPath	FADD	2
	xmmreg2

	MAXSS xmmreg,	F3h	0Fh	5Fh	mm-xxx-xxx	DirectPath	FADD	4
	mem32

	MINPS xmmreg1,	0Fh	5Dh		11-xxx-xxx	Double	FADD	3	1
	xmmreg2

	MINPS xmmreg,	0Fh	5Dh		mm-xxx-xxx	Double	FADD	5	1
	mem128

	MINSS xmmreg1,	F3h	0Fh	5Dh	11-xxx-xxx	DirectPath	FADD	2
	xmmreg2

	MINSS xmmreg,	F3h	0Fh	5Dh	mm-xxx-xxx	DirectPath	FADD	4
	mem32

	MOVAPS xmmreg1,	0Fh	28h		11-xxx-xxx	Double		2
	xmmreg2

	MOVAPS xmmreg,	0Fh	28h		mm-xxx-xxx	Double		2
	mem128

Notes:

1. The low half of the result is available one cycle earlier than listed.

2. The second latency value indicates when the low half of the result becomes available.

3. The high half of the result is available one cycle earlier than listed.

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal pipeline conditions.

5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be prefetched.

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is visible to the other stores and instructions.

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory speed and the hardware implementation.

Appendix C

Instruction Latencies

319

Image 335

AMD 250 manual 319, Fadd/Fmul

Contents

Software Optimization Guide For AMD64 Processors 2001 2005 Advanced Micro Devices, Inc. All rights reserved Contents General 64-Bit Optimizations Chapter Cache and Memory Optimizations Chapter Integer Optimizations 159 Chapter X87 Floating-Point Optimizations 237 Viii Appendix B Implementation of Write-Combining Index Software Optimization Guide for AMD64 Processors Tables Tables Xii Tables Figures Xiv Revision History Xvi Revision History Getting Started Quickly Intended Audience Using This Guide Providing Feedback Special InformationNumbering Systems Typographic Notation Internal Instruction Formats Important New TermsPrimitive Operations Mrom Types of InstructionsInstructions, Macro-ops and Micro-ops Optimizations by Rank Key OptimizationsGuideline Key Optimizations by Rank C++ Source-Level Optimizations C++ Source-Level Optimizations Rationale Declarations of Floating-Point ValuesOptimization Application Using Arrays and Pointers Matrix Example Instead, use the equivalent array notation Additional Considerations Related Information Unrolling Small Loops Expression Order in Compound Branch Conditions Chapter C++ Source-Level Optimizations Long Logical Expressions in If Statements Arrange Boolean Operands for Quick Expression Evaluation If *p == y && strlenp Dynamic Memory Allocation Consideration Listing 3. Avoid Unnecessary Store-to-Load Dependencies Application Examples Matching Store and Load Size Listing 6. Preferred C++ Source-Level Optimizations Switch and Noncontiguous Case Expressions Example Related Information Arranging Cases by Probability of Occurrence Use of Function Prototypes Use of const Type Qualifier Rationale and Examples Generic Loop Hoisting Listing Chapter C++ Source-Level Optimizations Local Static Functions Explicit Parallelism in Code Listing 11. Preferred Extracting Common Subexpressions Listing 15. Example 2 Preferred Sorting and Padding C and C++ Structures Sorting and Padding C and C++ Structures C++ Source-Level Optimizations Sorting Local Variables Related Information Replacing Integer Division with Multiplication Listing 16. Avoid Frequently Dereferenced Pointer Arguments Listing 17. Preferred Array Indices Rational 23 32-Bit Integral Data Types Sign of Integer Operands Listing 20. Example 2 Avoid Accelerating Floating-Point Division and Square Root Examples Fast Floating-Point-to-Integer Conversion Listing 23. Slow Branches Dependent on Integer Comparisons Are Fast Speeding Up Branches Based on Comparisons Between Floats Comparisons against Positive Constant Comparisons among Two Floats Improving Performance in Linux Libraries Software Optimization Guide for AMD64 Processors General 64-Bit Optimizations 64-Bit Registers and Integer Arithmetic This code performs 64-bit addition using 32-bit registers ESP+8ESP+4 = multiplicand Background 64-Bit Arithmetic and Large-Integer Multiplication G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0 XMM END Text Segment 128-Bit Media Instructions and Floating-Point Operations 32-Bit Legacy GPRs and Small Unsigned Integers Chapter General 64-Bit Optimizations General 64-Bit Optimizations Instruction-Decoding Optimizations DirectPath Instructions Load-Execute Integer Instructions Optimization Load-Execute Instructions Movss xmm0, floatvar1 mulss xmm0, floatvar2 Application Branch Targets in Program Hot Spots 32/64-Bit vs -Bit Forms of the LEA Instruction Take Advantage of x86 and AMD64 Complex Addressing Modes Cmpb %al,0x68e35%r10,%r13 Short Instruction Encodings Partial-Register Reads and Writes Avoid Functions That Do not Allocate Local Variables Using Leave for Function EpiloguesFunctions That Allocate Local Variables Traditional function epilogue looks like this Alternatives to Shld Instruction Lea reg1, reg1*8+reg2 10 8-Bit Sign-Extended Immediate Values 11 8-Bit Sign-Extended Displacements NOP Code Padding with Operand-Size Override Instruction-Decoding Optimizations Cache and Memory Optimizations Avoid Memory-Size MismatchesExamples-Store-to-Load-Forwarding Stalls Bit Avoid Preferred If Stores Are Close to the Load Preferred If the Contents of MM0 are No Longer NeededExamples-Large-to-small Mismatches Preferred If the Stores and Loads are Close Together, Option Natural Alignment of Data Objects Cache-Coherent Nonuniform Memory Access ccNUMA CPU0 OS Implications Dual-Core AMD Opteron Processor Configuration Multiprocessor Considerations 100 Store-to-Load Forwarding RestrictionsStore-to-Load Forwarding Pitfalls-True Dependencies Narrow-to-Wide Store-Buffer Data-Forwarding Restriction 101 Wide-to-Narrow Store-Buffer Data-Forwarding RestrictionMisaligned Store-Buffer Data-Forwarding Restriction 102 High-Byte Store-Buffer Data-Forwarding RestrictionOne Supported Store-to-Load Forwarding Case Store-to-Load Forwarding-False Dependencies 103 Summary of Store-to-Load-Forwarding Pitfalls to Avoid 104 Prefetch InstructionsPrefetching versus Preloading 105 Unit-Stride AccessHardware Prefetching PREFETCH/W versus PREFETCHNTA/T0/T1/T2 106 Prefetchw versus PrefetchWrite-Combining Usage 107 Multiple Prefetches 108 Determining Prefetch Distance Processor-Limited Code Memory-Limited Code Cache and Memory Optimizations Definitions 111 Prefetch at Least 64 Bytes Away from Surrounding Stores Streaming-Store/Non-Temporal Instructions 113 Write-combining Fields Used to Address the Multibank L1 Data Cache How to Know If a Bank Conflict ExistsL1 Data Cache Bank Conflicts 115 Placing Code and Data in the Same 64-Byte Cache Line 117 Sorting and Padding C and C++ Structures 118 119 120 Memory CopyCopying Small Data Structures 121 122 Stack ConsiderationsExtend Arguments to 32 Bits Before Pushing onto Stack Optimized Stack Usage 123 Cache Issues when Writing Instruction Bytes to Memory 124 Interleave Loads and Stores 125 This Chapter 126 Density of Branches Align 127 128 Two-Byte Near-Return RET Instruction 129 130 Branches That Depend on Random DataSigned Integer ABS Function x = labsx Unsigned Integer min Function z = x y ? x y 131 Conditional Write 132 Pairing Call and Return 133 Recursive Functions 134 135 Nonzero Code-Segment Base Values 136 Replacing Branches with ComputationMuxing Constructs 137 SSE Solution PreferredMMX Solution Avoid Example 2 C Code Sample Code Translated into AMD64 CodeExample 1 C Code Example 1 3DNow! Code Example 4 3DNow! Code Example 3 C CodeExample 3 3DNow! Code Example 4 C Code 140 Example 5 C CodeExample 5 3DNow! Code 141 Loop Instruction 142 Far Control-Transfer Instructions 143 Chapter Scheduling Optimizations 144 Instruction Scheduling by Latency 145 Loop UnrollingLoop Unrolling Example Partial Loop Unrolling Complete Loop UnrollingExample Complete Loop Unrolling Partial Loop Unrolling 147 Fadd 148 Deriving the Loop Control for Partially Unrolled Loops 149 Inline Functions 150 Additional Recommendations 151 Address-Generation InterlocksAddress-Generation Interlocks 152 153 Movzx and Movsx 154 Pointer Arithmetic in Loops 155 156 157 Pushing Memory Data Directly onto the Stack 158 159 Chapter Integer Optimizations Unsigned Division Utility Replacing Division with MultiplicationMultiplication by Reciprocal Division Utility Signed Division Utility 161 Unsigned Division by Multiplication of ConstantAlgorithm Divisors 1 = d 231, Odd d Algorithm Divisors 231 = d Signed Division by Signed Division by Multiplication of ConstantSimpler Code for Restricted Dividend Algorithm Divisors 2 = d Remainder of Signed Division by 2n or -2n Signed Division by 2nSigned Division by -2n Remainder of Signed Division by 2 or 164 Alternative Code for Multiplying by a Constant 165 166 167 Repeated String InstructionsLatency of Repeated String Instructions Guidelines for Repeated String Instructions 168 Use the Largest Possible Operand Size 169 Using XOR to Clear Integer RegistersAcceptable Bit Negation Efficient 64-Bit Integer Arithmetic in 32-Bit ModeBit Addition Bit Subtraction 171 Bit Right ShiftBit Multiplication Bit Unsigned Division 172 173 Bit Signed Division 174 Bit Unsigned Remainder Computation 175 176 Bit Signed Remainder Computation 177 178 179 180 Integer Version 181 Efficient Binary-to-ASCII Decimal ConversionBinary-to-ASCII Decimal Conversion Retaining Leading Zeros 182 183 Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros 184 185 186 Unsigned Integer Division 187 188 189 Signed Integer DivisionExample Code 190 191 192 Optimizing Integer Division 193 Optimizing with Simd Instructions 194 195 Ensure All Packed Floating-Point Data are Aligned 196 Rationale-Single Precision 197 Rational-Double Precision 198 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access 199 Use Movapd and Movaps Instead of Movupd and Movups Loop type Description 200 201 202 Double-Precision 32 ⋅ 32 Matrix Multiplication 203 204 205 206 207 208 Passing Data between MMX and 3DNow! Instructions 209 Storing Floating-Point Data in MMX Registers 210 Emms and Femms Usage XMM Text Segment 211 212 213 214 215 Single PrecisionDouble Precision 216 Clearing MMX and XMM Registers with XOR Instructions 217 218 219 220 221 Code below Puts the Floating Point Sign Mask 222 223 224 225 226 Pfpnacc 227 228 229 230 Listing 27 ⋅ 4 Matrix Multiplication SSE 231 XMM3 232 Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology 233 234 235 236 237 X87 Floating-Point Optimizations 238 Using Multiplication Rather Than Division 239 Achieving Two Floating-Point Operations per Clock Cycle 240 241 242 Align and Pack DirectPath x87 Instructions 243 244 Floating-Point Compare Instructions 245 Using the Fxch Instruction Rather Than FST/FLD Pairs 246 Floating-Point Subexpression Elimination 247 Accumulating Precision-Sensitive Quantities in x87 Registers 248 Avoiding Extended-Precision Data 249 Key Microarchitecture Features 251 Processor Block DiagramSuperscalar Processor L1 Instruction Cache AMD Athlon 64 and AMD Opteron Processors Block Diagram 253 L1 Instruction Cache Specifications by ProcessorBranch-Prediction Table Translation-Lookaside Buffer L1 Instruction TLB SpecificationsFetch-Decode Unit Instruction Control Unit 10 L1 Data Cache L1 Data TLB SpecificationsL2 TLB Specifications Integer Execution Unit L1 Data Cache Specifications by ProcessorInteger Scheduler 257 Floating-Point Scheduler Floating-Point Unit Floating-Point Execution UnitLoad-Store Unit 259 16 L2 Cache HyperTransport Technology Interface Buses for AMD Athlon 64 and AMD Opteron ProcessorIntegrated Memory Controller 261 HyperTransport Technology Software Optimization Guide for AMD64 Processors 263 Write-Combining Definitions and Abbreviations 264 Programming DetailsWrite-combining Operations 265 Write-Combining Completion Events 266 Sending Write-Buffer Data to the System 267 Optimizations 268 269 Appendix C Instruction Latencies 270 Understanding Instruction EntriesExample Instruction Entry Parts of the Instruction Entry 271 Interpreting Placeholders 272 Interpreting Latencies AAA Integer InstructionsInteger Instructions 273 ADD reg16/32/64, mem16/32/64 274 Bswap EAX/RAX/R8 275 276 277 Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64 278 CMP reg16/32/64, mreg16/32/64 279 280 281 282 283 JA/JNBE disp16/32 284 Lahf 285 286 Mfence 287 288 NOP Xchg EAX, EAX 289 290 291 292 Rdtsc 293Rdmsr Rdpmc Sahf 294 SBB reg16/32/64, mem16/32/64 295 296 297 298 STI 299STC STD Sysexit 300Syscall Sysenter 301 Xchg AX/EAX/RAX, DI/EDI/RDI/R15 302Xchg AX/EAX/RAX, SI/ESI/RSI/R14 303 MMX Technology InstructionsMMX Technology Instructions Fmul 304 305 306 307 X87 Floating-Point InstructionsX87 Floating-Point Instructions Fdecstp 308Fadd Fcompp Fadd Fcos FADD/FMUL Fstore Finit 309Fincstp 310 311 312 FYL2X 313Fxch Fxtract Femms 3DNow! Technology Instructions3DNow! Technology Instructions 314 315 DNow! Technology Instructions 316 3DNow! Technology Extensions3DNow! Technology Extensions 317 SSE InstructionsSSE Instructions 318 319 320 321 Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath 322 Sfence 323 324 325 326 SSE2 InstructionsSSE2 Instructions Fmul Fstore 3270FH 328 Maskmovdqu 329 Fadd Fmul Fstore 330 331 332 Fadd Fmul 333 334 335 336 337 338 Fmul Punpckhdq 339Fmul Punpckhbw 340 341 342 SSE3 InstructionsSSE3 Instructions 343 344 345 Fast-Write Optimizations 346 Fast-Write Optimizations for Graphics-Engine Programming 347 Cacheable-Memory Command Structure 348 349 Fast-Write Optimizations for Video-Memory Copies 350 351 Memory Optimizations 352 Northbridge Command Flow 353 Optimizations for Texture-Map Copies to AGP MemoryOptimizations for Vertex-Geometry Copies to AGP Memory 354 355 Types of XMM-Register DataTypes of SSE and SSE2 Instructions 356 Half-Register Operations 357 Zeroing Out an XMM RegisterClearing XMM Registers 358 359 Reuse of Dead Registers 360 Moving Data Between XMM Registers and GPRs 361 Saving and Restoring Registers of Unknown Format 362 SSE and SSE2 Copy Loops 363 Explicit Load Instructions 364 Data ConversionConverting Scalar Values INT GPR FPS Converting Vector ValuesConverting Directly from Memory 365 366 367 Numerics 368 Index