310 | AMD 250 instruction

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Table 15. x87 Floating-Point Instructions (Continued)

			Encoding		Decode	FPU
	Syntax						Latency	Note
		First	Second
				ModRM byte	type	pipe(s)
				ModRM byte	type	pipe(s)
		byte	byte
	FISTP [mem64int]	DFh		mm-111-xxx	DirectPath	FSTORE	4

	FISTTP [mem16int]	DFh		mm-010-xxx	DirectPath	FSTORE	4

	FISTTP [mem32int]	DBh		mm-010-xxx	DirectPath	FSTORE	4

	FISTTP [mem64int]	DDh		mm-010-xxx	DirectPath	FSTORE	4

	FISUB [mem32int]	DAh		mm-100-xxx	Double	-	11

	FISUB [mem16int]	DEh		mm-100-xxx	Double	-	11

	FISUBR [mem32int]	DAh		mm-101-xxx	Double	-	11

	FISUBR [mem16int]	DEh		mm-101-xxx	Double	-	11

	FLD ST(i)	D9h		11-000-xxx	DirectPath	FADD/FMUL	2	1

	FLD [mem32real]	D9h		mm-000-xxx	DirectPath	FADD/FMUL/	4
						FSTORE

	FLD [mem64real]	DDh		mm-000-xxx	DirectPath	FADD/FMUL/	4
						FSTORE

	FLD [mem80real]	DBh		mm-101-xxx	VectorPath	-	13

	FLD1	D9h		11-101-000	DirectPath	FSTORE	4

	FLDCW [mem16]	D9h		mm-101-xxx	VectorPath	-	11

	FLDENV [mem14byte]	D9h		mm-100-xxx	VectorPath	-	129

	FLDENV [mem28byte]	D9h		mm-100-xxx	VectorPath	-	129

	FLDL2E	D9h		11-101-010	DirectPath	FSTORE	4

	FLDL2T	D9h		11-101-001	DirectPath	FSTORE	4

	FLDLG2	D9h		11-101-100	DirectPath	FSTORE	4

	FLDLN2	D9h		11-101-101	DirectPath	FSTORE	4

	FLDPI	D9h		11-101-011	DirectPath	FSTORE	4

	FLDZ	D9h		11-101-110	DirectPath	FSTORE	4

	FMUL ST, ST(i)	D8h		11-001-xxx	DirectPath	FMUL	4	1

	FMUL ST(i), ST	DCh		11-001-xxx	DirectPath	FMUL	4	1

Notes:

1. The last three bits of the ModRM byte select the stack entry ST(i).

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of three per cycle and can use any of the three execution resources.

3. This is a VectorPath decoded operation that uses one execution pipe (one ROP).

4. There is additional latency associated with this instruction. “e” represents the difference between the exponents of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then

n = (s+1)/2 where (0 <= n <= 32).

5. The latency provided for this operation is the best-case latency.

6. The three latency numbers represent the latency values for precision control settings of single precision, double precision, and extended precision, respectively.

310

Instruction Latencies

Appendix C

Image 326

AMD 250 manual 310

Contents

Software Optimization Guide For AMD64 Processors 2001 2005 Advanced Micro Devices, Inc. All rights reserved Contents General 64-Bit Optimizations Chapter Cache and Memory Optimizations Chapter Integer Optimizations 159 Chapter X87 Floating-Point Optimizations 237 Appendix B Implementation of Write-Combining Viii Index Software Optimization Guide for AMD64 Processors Tables Tables Xii Tables Figures Xiv Revision History Xvi Revision History Intended Audience Getting Started Quickly Using This Guide Typographic Notation Special InformationNumbering Systems Providing Feedback Internal Instruction Formats Important New TermsPrimitive Operations Mrom Types of InstructionsInstructions, Macro-ops and Micro-ops Key Optimizations by Rank Key OptimizationsGuideline Optimizations by Rank C++ Source-Level Optimizations C++ Source-Level Optimizations Application Declarations of Floating-Point ValuesOptimization Rationale Using Arrays and Pointers Example Matrix Additional Considerations Instead, use the equivalent array notation Unrolling Small Loops Related Information Expression Order in Compound Branch Conditions Chapter C++ Source-Level Optimizations Long Logical Expressions in If Statements Arrange Boolean Operands for Quick Expression Evaluation If *p == y && strlenp Dynamic Memory Allocation Consideration Unnecessary Store-to-Load Dependencies Listing 3. Avoid Application Matching Store and Load Size Examples Listing 6. Preferred C++ Source-Level Optimizations Switch and Noncontiguous Case Expressions Example Related Information Arranging Cases by Probability of Occurrence Use of Function Prototypes Use of const Type Qualifier Generic Loop Hoisting Rationale and Examples Listing Chapter C++ Source-Level Optimizations Local Static Functions Explicit Parallelism in Code Listing 11. Preferred Extracting Common Subexpressions Listing 15. Example 2 Preferred Sorting and Padding C and C++ Structures Sorting and Padding C and C++ Structures C++ Source-Level Optimizations Sorting Local Variables Related Information Replacing Integer Division with Multiplication Frequently Dereferenced Pointer Arguments Listing 16. Avoid Listing 17. Preferred Array Indices 23 32-Bit Integral Data Types Rational Sign of Integer Operands Listing 20. Example 2 Avoid Accelerating Floating-Point Division and Square Root Examples Fast Floating-Point-to-Integer Conversion Listing 23. Slow Speeding Up Branches Based on Comparisons Between Floats Branches Dependent on Integer Comparisons Are Fast Comparisons against Positive Constant Comparisons among Two Floats Improving Performance in Linux Libraries Software Optimization Guide for AMD64 Processors General 64-Bit Optimizations This code performs 64-bit addition using 32-bit registers 64-Bit Registers and Integer Arithmetic ESP+8ESP+4 = multiplicand 64-Bit Arithmetic and Large-Integer Multiplication Background G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0 XMM END Text Segment 128-Bit Media Instructions and Floating-Point Operations 32-Bit Legacy GPRs and Small Unsigned Integers Chapter General 64-Bit Optimizations General 64-Bit Optimizations Instruction-Decoding Optimizations DirectPath Instructions Load-Execute Instructions Load-Execute Integer Instructions Optimization Movss xmm0, floatvar1 mulss xmm0, floatvar2 Application Branch Targets in Program Hot Spots 32/64-Bit vs -Bit Forms of the LEA Instruction Take Advantage of x86 and AMD64 Complex Addressing Modes Cmpb %al,0x68e35%r10,%r13 Short Instruction Encodings Partial-Register Reads and Writes Avoid Functions That Do not Allocate Local Variables Using Leave for Function EpiloguesFunctions That Allocate Local Variables Traditional function epilogue looks like this Alternatives to Shld Instruction Lea reg1, reg1*8+reg2 10 8-Bit Sign-Extended Immediate Values 11 8-Bit Sign-Extended Displacements Code Padding with Operand-Size Override NOP Instruction-Decoding Optimizations Cache and Memory Optimizations Bit Avoid Memory-Size MismatchesExamples-Store-to-Load-Forwarding Stalls Avoid Preferred If Stores Are Close to the Load Preferred If the Contents of MM0 are No Longer NeededExamples-Large-to-small Mismatches Preferred If the Stores and Loads are Close Together, Option Natural Alignment of Data Objects Cache-Coherent Nonuniform Memory Access ccNUMA CPU0 Dual-Core AMD Opteron Processor Configuration OS Implications Multiprocessor Considerations Narrow-to-Wide Store-Buffer Data-Forwarding Restriction Store-to-Load Forwarding RestrictionsStore-to-Load Forwarding Pitfalls-True Dependencies 100 101 Wide-to-Narrow Store-Buffer Data-Forwarding RestrictionMisaligned Store-Buffer Data-Forwarding Restriction Store-to-Load Forwarding-False Dependencies High-Byte Store-Buffer Data-Forwarding RestrictionOne Supported Store-to-Load Forwarding Case 102 Summary of Store-to-Load-Forwarding Pitfalls to Avoid 103 104 Prefetch InstructionsPrefetching versus Preloading PREFETCH/W versus PREFETCHNTA/T0/T1/T2 Unit-Stride AccessHardware Prefetching 105 106 Prefetchw versus PrefetchWrite-Combining Usage Multiple Prefetches 107 Determining Prefetch Distance 108 Memory-Limited Code Processor-Limited Code Definitions Cache and Memory Optimizations Prefetch at Least 64 Bytes Away from Surrounding Stores 111 Streaming-Store/Non-Temporal Instructions Write-combining 113 Fields Used to Address the Multibank L1 Data Cache How to Know If a Bank Conflict ExistsL1 Data Cache Bank Conflicts 115 Placing Code and Data in the Same 64-Byte Cache Line Sorting and Padding C and C++ Structures 117 118 119 120 Memory CopyCopying Small Data Structures 121 Optimized Stack Usage Stack ConsiderationsExtend Arguments to 32 Bits Before Pushing onto Stack 122 Cache Issues when Writing Instruction Bytes to Memory 123 Interleave Loads and Stores 124 This Chapter 125 Density of Branches 126 127 Align Two-Byte Near-Return RET Instruction 128 129 Unsigned Integer min Function z = x y ? x y Branches That Depend on Random DataSigned Integer ABS Function x = labsx 130 Conditional Write 131 Pairing Call and Return 132 Recursive Functions 133 134 Nonzero Code-Segment Base Values 135 136 Replacing Branches with ComputationMuxing Constructs 137 SSE Solution PreferredMMX Solution Avoid Example 1 3DNow! Code Sample Code Translated into AMD64 CodeExample 1 C Code Example 2 C Code Example 4 C Code Example 3 C CodeExample 3 3DNow! Code Example 4 3DNow! Code 140 Example 5 C CodeExample 5 3DNow! Code Loop Instruction 141 Far Control-Transfer Instructions 142 Chapter Scheduling Optimizations 143 Instruction Scheduling by Latency 144 145 Loop UnrollingLoop Unrolling Partial Loop Unrolling Complete Loop UnrollingExample Complete Loop Unrolling Example Partial Loop Unrolling Fadd 147 Deriving the Loop Control for Partially Unrolled Loops 148 Inline Functions 149 Additional Recommendations 150 151 Address-Generation InterlocksAddress-Generation Interlocks 152 Movzx and Movsx 153 Pointer Arithmetic in Loops 154 155 156 Pushing Memory Data Directly onto the Stack 157 158 Chapter Integer Optimizations 159 Signed Division Utility Replacing Division with MultiplicationMultiplication by Reciprocal Division Utility Unsigned Division Utility Algorithm Divisors 231 = d Unsigned Division by Multiplication of ConstantAlgorithm Divisors 1 = d 231, Odd d 161 Algorithm Divisors 2 = d Signed Division by Multiplication of ConstantSimpler Code for Restricted Dividend Signed Division by Remainder of Signed Division by 2 or Signed Division by 2nSigned Division by -2n Remainder of Signed Division by 2n or -2n Alternative Code for Multiplying by a Constant 164 165 166 Guidelines for Repeated String Instructions Repeated String InstructionsLatency of Repeated String Instructions 167 Use the Largest Possible Operand Size 168 169 Using XOR to Clear Integer RegistersAcceptable Bit Subtraction Efficient 64-Bit Integer Arithmetic in 32-Bit ModeBit Addition Bit Negation Bit Unsigned Division Bit Right ShiftBit Multiplication 171 172 Bit Signed Division 173 Bit Unsigned Remainder Computation 174 175 Bit Signed Remainder Computation 176 177 178 179 Integer Version 180 181 Efficient Binary-to-ASCII Decimal ConversionBinary-to-ASCII Decimal Conversion Retaining Leading Zeros 182 Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros 183 184 185 Unsigned Integer Division 186 187 188 189 Signed Integer DivisionExample Code 190 191 Optimizing Integer Division 192 Optimizing with Simd Instructions 193 194 Ensure All Packed Floating-Point Data are Aligned 195 Rationale-Single Precision 196 Rational-Double Precision 197 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access 198 Use Movapd and Movaps Instead of Movupd and Movups 199 200 Loop type Description 201 Double-Precision 32 ⋅ 32 Matrix Multiplication 202 203 204 205 206 207 Passing Data between MMX and 3DNow! Instructions 208 Storing Floating-Point Data in MMX Registers 209 Emms and Femms Usage 210 211 XMM Text Segment 212 213 214 215 Single PrecisionDouble Precision Clearing MMX and XMM Registers with XOR Instructions 216 217 218 219 220 221 222 Code below Puts the Floating Point Sign Mask 223 224 225 226 227 Pfpnacc 228 229 230 231 Listing 27 ⋅ 4 Matrix Multiplication SSE 232 XMM3 233 Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology 234 235 236 X87 Floating-Point Optimizations 237 Using Multiplication Rather Than Division 238 Achieving Two Floating-Point Operations per Clock Cycle 239 240 241 Align and Pack DirectPath x87 Instructions 242 243 Floating-Point Compare Instructions 244 Using the Fxch Instruction Rather Than FST/FLD Pairs 245 Floating-Point Subexpression Elimination 246 Accumulating Precision-Sensitive Quantities in x87 Registers 247 Avoiding Extended-Precision Data 248 249 Key Microarchitecture Features 251 Processor Block DiagramSuperscalar Processor AMD Athlon 64 and AMD Opteron Processors Block Diagram L1 Instruction Cache 253 L1 Instruction Cache Specifications by ProcessorBranch-Prediction Table Instruction Control Unit L1 Instruction TLB SpecificationsFetch-Decode Unit Translation-Lookaside Buffer 10 L1 Data Cache L1 Data TLB SpecificationsL2 TLB Specifications Integer Execution Unit L1 Data Cache Specifications by ProcessorInteger Scheduler Floating-Point Scheduler 257 Floating-Point Unit Floating-Point Execution UnitLoad-Store Unit 16 L2 Cache 259 HyperTransport Technology Interface Buses for AMD Athlon 64 and AMD Opteron ProcessorIntegrated Memory Controller HyperTransport Technology 261 Software Optimization Guide for AMD64 Processors Write-Combining Definitions and Abbreviations 263 264 Programming DetailsWrite-combining Operations Write-Combining Completion Events 265 Sending Write-Buffer Data to the System 266 Optimizations 267 268 Appendix C Instruction Latencies 269 Parts of the Instruction Entry Understanding Instruction EntriesExample Instruction Entry 270 Interpreting Placeholders 271 Interpreting Latencies 272 273 Integer InstructionsInteger Instructions AAA 274 ADD reg16/32/64, mem16/32/64 275 Bswap EAX/RAX/R8 276 277 278 Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64 279 CMP reg16/32/64, mreg16/32/64 280 281 282 283 284 JA/JNBE disp16/32 285 Lahf 286 287 Mfence 288 289 NOP Xchg EAX, EAX 290 291 292 Rdpmc 293Rdmsr Rdtsc 294 Sahf 295 SBB reg16/32/64, mem16/32/64 296 297 298 STD 299STC STI Sysenter 300Syscall Sysexit 301 Xchg AX/EAX/RAX, DI/EDI/RDI/R15 302Xchg AX/EAX/RAX, SI/ESI/RSI/R14 303 MMX Technology InstructionsMMX Technology Instructions 304 Fmul 305 306 307 X87 Floating-Point InstructionsX87 Floating-Point Instructions Fadd Fcos 308Fadd Fcompp Fdecstp FADD/FMUL Fstore Finit 309Fincstp 310 311 312 Fxtract 313Fxch FYL2X 314 3DNow! Technology Instructions3DNow! Technology Instructions Femms DNow! Technology Instructions 315 316 3DNow! Technology Extensions3DNow! Technology Extensions 317 SSE InstructionsSSE Instructions 318 319 320 321 322 Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath 323 Sfence 324 325 326 SSE2 InstructionsSSE2 Instructions Fmul Fstore 3270FH 328 329 Maskmovdqu 330 Fadd Fmul Fstore 331 332 333 Fadd Fmul 334 335 336 337 338 Fmul Punpckhdq 339Fmul Punpckhbw 340 341 342 SSE3 InstructionsSSE3 Instructions 343 344 Fast-Write Optimizations 345 Fast-Write Optimizations for Graphics-Engine Programming 346 Cacheable-Memory Command Structure 347 348 Fast-Write Optimizations for Video-Memory Copies 349 350 Memory Optimizations 351 Northbridge Command Flow 352 353 Optimizations for Texture-Map Copies to AGP MemoryOptimizations for Vertex-Geometry Copies to AGP Memory 354 355 Types of XMM-Register DataTypes of SSE and SSE2 Instructions Half-Register Operations 356 357 Zeroing Out an XMM RegisterClearing XMM Registers 358 Reuse of Dead Registers 359 Moving Data Between XMM Registers and GPRs 360 Saving and Restoring Registers of Unknown Format 361 SSE and SSE2 Copy Loops 362 Explicit Load Instructions 363 364 Data ConversionConverting Scalar Values 365 Converting Vector ValuesConverting Directly from Memory INT GPR FPS 366 Numerics 367 368 Index