Software Optimization Guide For AMD64 Processors
 2001 2005 Advanced Micro Devices, Inc. All rights reserved
 Contents
 General 64-Bit Optimizations
 Chapter Cache and Memory Optimizations
 Chapter Integer Optimizations 159
 Chapter X87 Floating-Point Optimizations 237
 Viii
Appendix B Implementation of Write-Combining
 Index
 Software Optimization Guide for AMD64 Processors
 Tables
Tables
 Xii Tables
 Figures
 Xiv
 Revision History
 Xvi Revision History
 Getting Started Quickly
Intended Audience
 Using This Guide
 Providing Feedback
Special Information
Numbering Systems
Typographic Notation
 Internal Instruction Formats
Important New Terms
Primitive Operations
 Mrom
Types of Instructions
Instructions, Macro-ops and Micro-ops
 Optimizations by Rank
Key Optimizations
Guideline
Key Optimizations by Rank
 C++ Source-Level Optimizations
 C++ Source-Level Optimizations
 Rationale
Declarations of Floating-Point Values
Optimization
Application
 Using Arrays and Pointers
 Matrix
Example
 Instead, use the equivalent array notation
Additional Considerations
 Related Information
Unrolling Small Loops
 Expression Order in Compound Branch Conditions
 Chapter C++ Source-Level Optimizations
 Long Logical Expressions in If Statements
 Arrange Boolean Operands for Quick Expression Evaluation
 If *p == y && strlenp
 Dynamic Memory Allocation Consideration
 Listing 3. Avoid
Unnecessary Store-to-Load Dependencies
 Application
 Examples
Matching Store and Load Size
 Listing 6. Preferred
 C++ Source-Level Optimizations
 Switch and Noncontiguous Case Expressions
 Example
 Related Information
 Arranging Cases by Probability of Occurrence
 Use of Function Prototypes
 Use of const Type Qualifier
 Rationale and Examples
Generic Loop Hoisting
 Listing
 Chapter C++ Source-Level Optimizations
 Local Static Functions
 Explicit Parallelism in Code
 Listing 11. Preferred
 Extracting Common Subexpressions
 Listing 15. Example 2 Preferred
 Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
 C++ Source-Level Optimizations
 Sorting Local Variables
 Related Information
 Replacing Integer Division with Multiplication
 Listing 16. Avoid
Frequently Dereferenced Pointer Arguments
 Listing 17. Preferred
 Array Indices
 Rational
23 32-Bit Integral Data Types
 Sign of Integer Operands
 Listing 20. Example 2 Avoid
 Accelerating Floating-Point Division and Square Root
 Examples
 Fast Floating-Point-to-Integer Conversion
 Listing 23. Slow
 Branches Dependent on Integer Comparisons Are Fast
Speeding Up Branches Based on Comparisons Between Floats
 Comparisons against Positive Constant
 Comparisons among Two Floats
 Improving Performance in Linux Libraries
 Software Optimization Guide for AMD64 Processors
 General 64-Bit Optimizations
 64-Bit Registers and Integer Arithmetic
This code performs 64-bit addition using 32-bit registers
 ESP+8ESP+4 = multiplicand
 Background
64-Bit Arithmetic and Large-Integer Multiplication
 G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0
 XMM
 END
 Text Segment
 128-Bit Media Instructions and Floating-Point Operations
 32-Bit Legacy GPRs and Small Unsigned Integers
 Chapter General 64-Bit Optimizations
 General 64-Bit Optimizations
 Instruction-Decoding Optimizations
 DirectPath Instructions
 Load-Execute Integer Instructions Optimization
Load-Execute Instructions
 Movss xmm0, floatvar1 mulss xmm0, floatvar2
 Application
 Branch Targets in Program Hot Spots
 32/64-Bit vs -Bit Forms of the LEA Instruction
 Take Advantage of x86 and AMD64 Complex Addressing Modes
 Cmpb %al,0x68e35%r10,%r13
 Short Instruction Encodings
 Partial-Register Reads and Writes
 Avoid
 Functions That Do not Allocate Local Variables
Using Leave for Function Epilogues
Functions That Allocate Local Variables
 Traditional function epilogue looks like this
 Alternatives to Shld Instruction
 Lea reg1, reg1*8+reg2
 10 8-Bit Sign-Extended Immediate Values
 11 8-Bit Sign-Extended Displacements
 NOP
Code Padding with Operand-Size Override
 Instruction-Decoding Optimizations
 Cache and Memory Optimizations
 Avoid
Memory-Size Mismatches
Examples-Store-to-Load-Forwarding Stalls
Bit Avoid
 Preferred If Stores Are Close to the Load
Preferred If the Contents of MM0 are No Longer Needed
Examples-Large-to-small Mismatches
 Preferred If the Stores and Loads are Close Together, Option
 Natural Alignment of Data Objects
 Cache-Coherent Nonuniform Memory Access ccNUMA
 CPU0
 OS Implications
Dual-Core AMD Opteron Processor Configuration
 Multiprocessor Considerations
 100
Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding Pitfalls-True Dependencies
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
 101
Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
 102
High-Byte Store-Buffer Data-Forwarding Restriction
One Supported Store-to-Load Forwarding Case
Store-to-Load Forwarding-False Dependencies
 103
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
 104
Prefetch Instructions
Prefetching versus Preloading
 105
Unit-Stride Access
Hardware Prefetching
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
 106
Prefetchw versus Prefetch
Write-Combining Usage
 107
Multiple Prefetches
 108
Determining Prefetch Distance
 Processor-Limited Code
Memory-Limited Code
 Cache and Memory Optimizations
Definitions
 111
Prefetch at Least 64 Bytes Away from Surrounding Stores
 Streaming-Store/Non-Temporal Instructions
 113
Write-combining
 Fields Used to Address the Multibank L1 Data Cache
How to Know If a Bank Conflict Exists
L1 Data Cache Bank Conflicts
 115
 Placing Code and Data in the Same 64-Byte Cache Line
 117
Sorting and Padding C and C++ Structures
 118
 119
 120
Memory Copy
Copying Small Data Structures
 121
 122
Stack Considerations
Extend Arguments to 32 Bits Before Pushing onto Stack
Optimized Stack Usage
 123
Cache Issues when Writing Instruction Bytes to Memory
 124
Interleave Loads and Stores
 125
This Chapter
 126
Density of Branches
 Align
127
 128
Two-Byte Near-Return RET Instruction
 129
 130
Branches That Depend on Random Data
Signed Integer ABS Function x = labsx
Unsigned Integer min Function z = x y ? x y
 131
Conditional Write
 132
Pairing Call and Return
 133
Recursive Functions
 134
 135
Nonzero Code-Segment Base Values
 136
Replacing Branches with Computation
Muxing Constructs
 137
SSE Solution Preferred
MMX Solution Avoid
 Example 2 C Code
Sample Code Translated into AMD64 Code
Example 1 C Code
Example 1 3DNow! Code
 Example 4 3DNow! Code
Example 3 C Code
Example 3 3DNow! Code
Example 4 C Code
 140
Example 5 C Code
Example 5 3DNow! Code
 141
Loop Instruction
 142
Far Control-Transfer Instructions
 143
Chapter Scheduling Optimizations
 144
Instruction Scheduling by Latency
 145
Loop Unrolling
Loop Unrolling
 Example Partial Loop Unrolling
Complete Loop Unrolling
Example Complete Loop Unrolling
Partial Loop Unrolling
 147
Fadd
 148
Deriving the Loop Control for Partially Unrolled Loops
 149
Inline Functions
 150
Additional Recommendations
 151
Address-Generation Interlocks
Address-Generation Interlocks
 152
 153
Movzx and Movsx
 154
Pointer Arithmetic in Loops
 155
 156
 157
Pushing Memory Data Directly onto the Stack
 158
 159
Chapter Integer Optimizations
 Unsigned Division Utility
Replacing Division with Multiplication
Multiplication by Reciprocal Division Utility
Signed Division Utility
 161
Unsigned Division by Multiplication of Constant
Algorithm Divisors 1 = d 231, Odd d
Algorithm Divisors 231 = d
 Signed Division by
Signed Division by Multiplication of Constant
Simpler Code for Restricted Dividend
Algorithm Divisors 2 = d
 Remainder of Signed Division by 2n or -2n
Signed Division by 2n
Signed Division by -2n
Remainder of Signed Division by 2 or
 164
Alternative Code for Multiplying by a Constant
 165
 166
 167
Repeated String Instructions
Latency of Repeated String Instructions
Guidelines for Repeated String Instructions
 168
Use the Largest Possible Operand Size
 169
Using XOR to Clear Integer Registers
Acceptable
 Bit Negation
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Bit Addition
Bit Subtraction
 171
Bit Right Shift
Bit Multiplication
Bit Unsigned Division
 172
 173
Bit Signed Division
 174
Bit Unsigned Remainder Computation
 175
 176
Bit Signed Remainder Computation
 177
 178
 179
 180
Integer Version
 181
Efficient Binary-to-ASCII Decimal Conversion
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
 182
 183
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
 184
 185
 186
Unsigned Integer Division
 187
 188
 189
Signed Integer Division
Example Code
 190
 191
 192
Optimizing Integer Division
 193
Optimizing with Simd Instructions
 194
 195
Ensure All Packed Floating-Point Data are Aligned
 196
Rationale-Single Precision
 197
Rational-Double Precision
 198
Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
 199
Use Movapd and Movaps Instead of Movupd and Movups
 Loop type Description
200
 201
 202
Double-Precision 32 ⋅ 32 Matrix Multiplication
 203
 204
 205
 206
 207
 208
Passing Data between MMX and 3DNow! Instructions
 209
Storing Floating-Point Data in MMX Registers
 210
Emms and Femms Usage
 XMM Text Segment
211
 212
 213
 214
 215
Single Precision
Double Precision
 216
Clearing MMX and XMM Registers with XOR Instructions
 217
 218
 219
 220
 221
 Code below Puts the Floating Point Sign Mask
222
 223
 224
 225
 226
 Pfpnacc
227
 228
 229
 230
 Listing 27 ⋅ 4 Matrix Multiplication SSE
231
 XMM3
232
 Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology
233
 234
 235
 236
 237
X87 Floating-Point Optimizations
 238
Using Multiplication Rather Than Division
 239
Achieving Two Floating-Point Operations per Clock Cycle
 240
 241
 242
Align and Pack DirectPath x87 Instructions
 243
 244
Floating-Point Compare Instructions
 245
Using the Fxch Instruction Rather Than FST/FLD Pairs
 246
Floating-Point Subexpression Elimination
 247
Accumulating Precision-Sensitive Quantities in x87 Registers
 248
Avoiding Extended-Precision Data
 249
 Key Microarchitecture Features
 251
Processor Block Diagram
Superscalar Processor
 L1 Instruction Cache
AMD Athlon 64 and AMD Opteron Processors Block Diagram
 253
L1 Instruction Cache Specifications by Processor
Branch-Prediction Table
 Translation-Lookaside Buffer
L1 Instruction TLB Specifications
Fetch-Decode Unit
Instruction Control Unit
 10 L1 Data Cache
L1 Data TLB Specifications
L2 TLB Specifications
 Integer Execution Unit
L1 Data Cache Specifications by Processor
Integer Scheduler
 257
Floating-Point Scheduler
 Floating-Point Unit
Floating-Point Execution Unit
Load-Store Unit
 259
16 L2 Cache
 HyperTransport Technology Interface
Buses for AMD Athlon 64 and AMD Opteron Processor
Integrated Memory Controller
 261
HyperTransport Technology
 Software Optimization Guide for AMD64 Processors
 263
Write-Combining Definitions and Abbreviations
 264
Programming Details
Write-combining Operations
 265
Write-Combining Completion Events
 266
Sending Write-Buffer Data to the System
 267
Optimizations
 268
 269
Appendix C Instruction Latencies
 270
Understanding Instruction Entries
Example Instruction Entry
Parts of the Instruction Entry
 271
Interpreting Placeholders
 272
Interpreting Latencies
 AAA
Integer Instructions
Integer Instructions
273
 ADD reg16/32/64, mem16/32/64
274
 Bswap EAX/RAX/R8
275
 276
 277
 Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64
278
 CMP reg16/32/64, mreg16/32/64
279
 280
 281
 282
 283
 JA/JNBE disp16/32
284
 Lahf
285
 286
 Mfence
287
 288
 NOP Xchg EAX, EAX
289
 290
 291
 292
 Rdtsc
293
Rdmsr
Rdpmc
 Sahf
294
 SBB reg16/32/64, mem16/32/64
295
 296
 297
 298
 STI
299
STC
STD
 Sysexit
300
Syscall
Sysenter
 301
 Xchg AX/EAX/RAX, DI/EDI/RDI/R15
302
Xchg AX/EAX/RAX, SI/ESI/RSI/R14
 303
MMX Technology Instructions
MMX Technology Instructions
 Fmul
304
 305
 306
 307
X87 Floating-Point Instructions
X87 Floating-Point Instructions
 Fdecstp
308
Fadd Fcompp
Fadd Fcos
 FADD/FMUL Fstore Finit
309
Fincstp
 310
 311
 312
 FYL2X
313
Fxch
Fxtract
 Femms
3DNow! Technology Instructions
3DNow! Technology Instructions
314
 315
DNow! Technology Instructions
 316
3DNow! Technology Extensions
3DNow! Technology Extensions
 317
SSE Instructions
SSE Instructions
 318
 319
 320
 321
 Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath
322
 Sfence
323
 324
 325
 326
SSE2 Instructions
SSE2 Instructions
 Fmul Fstore
327
0FH
 328
 Maskmovdqu
329
 Fadd Fmul Fstore
330
 331
 332
 Fadd Fmul
333
 334
 335
 336
 337
 338
 Fmul Punpckhdq
339
Fmul Punpckhbw
 340
 341
 342
SSE3 Instructions
SSE3 Instructions
 343
 344
 345
Fast-Write Optimizations
 346
Fast-Write Optimizations for Graphics-Engine Programming
 347
Cacheable-Memory Command Structure
 348
 349
Fast-Write Optimizations for Video-Memory Copies
 350
 351
Memory Optimizations
 352
Northbridge Command Flow
 353
Optimizations for Texture-Map Copies to AGP Memory
Optimizations for Vertex-Geometry Copies to AGP Memory
 354
 355
Types of XMM-Register Data
Types of SSE and SSE2 Instructions
 356
Half-Register Operations
 357
Zeroing Out an XMM Register
Clearing XMM Registers
 358
 359
Reuse of Dead Registers
 360
Moving Data Between XMM Registers and GPRs
 361
Saving and Restoring Registers of Unknown Format
 362
SSE and SSE2 Copy Loops
 363
Explicit Load Instructions
 364
Data Conversion
Converting Scalar Values
 INT GPR FPS
Converting Vector Values
Converting Directly from Memory
365
 366
 367
Numerics
 368 Index