Software Optimization Guide For AMD64 Processors
 2001 2005 Advanced Micro Devices, Inc. All rights reserved
 Contents
 General 64-Bit Optimizations
 Chapter Cache and Memory Optimizations
 Chapter Integer Optimizations 159
 Chapter X87 Floating-Point Optimizations 237
 Viii
Appendix B Implementation of Write-Combining
 Index
 Software Optimization Guide for AMD64 Processors
 Tables
Tables
 Xii Tables
 Figures
 Xiv
 Revision History
 Xvi Revision History
 Getting Started Quickly
Intended Audience
 Using This Guide
 Numbering Systems
Special Information
Typographic Notation
Providing Feedback
 Internal Instruction Formats
Important New Terms
Primitive Operations
 Mrom
Types of Instructions
Instructions, Macro-ops and Micro-ops
 Guideline
Key Optimizations
Key Optimizations by Rank
Optimizations by Rank
 C++ Source-Level Optimizations
 C++ Source-Level Optimizations
 Optimization
Declarations of Floating-Point Values
Application
Rationale
 Using Arrays and Pointers
 Matrix
Example
 Instead, use the equivalent array notation
Additional Considerations
 Related Information
Unrolling Small Loops
 Expression Order in Compound Branch Conditions
 Chapter C++ Source-Level Optimizations
 Long Logical Expressions in If Statements
 Arrange Boolean Operands for Quick Expression Evaluation
 If *p == y && strlenp
 Dynamic Memory Allocation Consideration
 Listing 3. Avoid
Unnecessary Store-to-Load Dependencies
 Application
 Examples
Matching Store and Load Size
 Listing 6. Preferred
 C++ Source-Level Optimizations
 Switch and Noncontiguous Case Expressions
 Example
 Related Information
 Arranging Cases by Probability of Occurrence
 Use of Function Prototypes
 Use of const Type Qualifier
 Rationale and Examples
Generic Loop Hoisting
 Listing
 Chapter C++ Source-Level Optimizations
 Local Static Functions
 Explicit Parallelism in Code
 Listing 11. Preferred
 Extracting Common Subexpressions
 Listing 15. Example 2 Preferred
 Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
 C++ Source-Level Optimizations
 Sorting Local Variables
 Related Information
 Replacing Integer Division with Multiplication
 Listing 16. Avoid
Frequently Dereferenced Pointer Arguments
 Listing 17. Preferred
 Array Indices
 Rational
23 32-Bit Integral Data Types
 Sign of Integer Operands
 Listing 20. Example 2 Avoid
 Accelerating Floating-Point Division and Square Root
 Examples
 Fast Floating-Point-to-Integer Conversion
 Listing 23. Slow
 Branches Dependent on Integer Comparisons Are Fast
Speeding Up Branches Based on Comparisons Between Floats
 Comparisons against Positive Constant
 Comparisons among Two Floats
 Improving Performance in Linux Libraries
 Software Optimization Guide for AMD64 Processors
 General 64-Bit Optimizations
 64-Bit Registers and Integer Arithmetic
This code performs 64-bit addition using 32-bit registers
 ESP+8ESP+4 = multiplicand
 Background
64-Bit Arithmetic and Large-Integer Multiplication
 G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0
 XMM
 END
 Text Segment
 128-Bit Media Instructions and Floating-Point Operations
 32-Bit Legacy GPRs and Small Unsigned Integers
 Chapter General 64-Bit Optimizations
 General 64-Bit Optimizations
 Instruction-Decoding Optimizations
 DirectPath Instructions
 Load-Execute Integer Instructions Optimization
Load-Execute Instructions
 Movss xmm0, floatvar1 mulss xmm0, floatvar2
 Application
 Branch Targets in Program Hot Spots
 32/64-Bit vs -Bit Forms of the LEA Instruction
 Take Advantage of x86 and AMD64 Complex Addressing Modes
 Cmpb %al,0x68e35%r10,%r13
 Short Instruction Encodings
 Partial-Register Reads and Writes
 Avoid
 Functions That Do not Allocate Local Variables
Using Leave for Function Epilogues
Functions That Allocate Local Variables
 Traditional function epilogue looks like this
 Alternatives to Shld Instruction
 Lea reg1, reg1*8+reg2
 10 8-Bit Sign-Extended Immediate Values
 11 8-Bit Sign-Extended Displacements
 NOP
Code Padding with Operand-Size Override
 Instruction-Decoding Optimizations
 Cache and Memory Optimizations
 Examples-Store-to-Load-Forwarding Stalls
Memory-Size Mismatches
Bit Avoid
Avoid
 Preferred If Stores Are Close to the Load
Preferred If the Contents of MM0 are No Longer Needed
Examples-Large-to-small Mismatches
 Preferred If the Stores and Loads are Close Together, Option
 Natural Alignment of Data Objects
 Cache-Coherent Nonuniform Memory Access ccNUMA
 CPU0
 OS Implications
Dual-Core AMD Opteron Processor Configuration
 Multiprocessor Considerations
 Store-to-Load Forwarding Pitfalls-True Dependencies
Store-to-Load Forwarding Restrictions
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
100
 101
Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
 One Supported Store-to-Load Forwarding Case
High-Byte Store-Buffer Data-Forwarding Restriction
Store-to-Load Forwarding-False Dependencies
102
 103
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
 104
Prefetch Instructions
Prefetching versus Preloading
 Hardware Prefetching
Unit-Stride Access
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
105
 106
Prefetchw versus Prefetch
Write-Combining Usage
 107
Multiple Prefetches
 108
Determining Prefetch Distance
 Processor-Limited Code
Memory-Limited Code
 Cache and Memory Optimizations
Definitions
 111
Prefetch at Least 64 Bytes Away from Surrounding Stores
 Streaming-Store/Non-Temporal Instructions
 113
Write-combining
 Fields Used to Address the Multibank L1 Data Cache
How to Know If a Bank Conflict Exists
L1 Data Cache Bank Conflicts
 115
 Placing Code and Data in the Same 64-Byte Cache Line
 117
Sorting and Padding C and C++ Structures
 118
 119
 120
Memory Copy
Copying Small Data Structures
 121
 Extend Arguments to 32 Bits Before Pushing onto Stack
Stack Considerations
Optimized Stack Usage
122
 123
Cache Issues when Writing Instruction Bytes to Memory
 124
Interleave Loads and Stores
 125
This Chapter
 126
Density of Branches
 Align
127
 128
Two-Byte Near-Return RET Instruction
 129
 Signed Integer ABS Function x = labsx
Branches That Depend on Random Data
Unsigned Integer min Function z = x y ? x y
130
 131
Conditional Write
 132
Pairing Call and Return
 133
Recursive Functions
 134
 135
Nonzero Code-Segment Base Values
 136
Replacing Branches with Computation
Muxing Constructs
 137
SSE Solution Preferred
MMX Solution Avoid
 Example 1 C Code
Sample Code Translated into AMD64 Code
Example 1 3DNow! Code
Example 2 C Code
 Example 3 3DNow! Code
Example 3 C Code
Example 4 C Code
Example 4 3DNow! Code
 140
Example 5 C Code
Example 5 3DNow! Code
 141
Loop Instruction
 142
Far Control-Transfer Instructions
 143
Chapter Scheduling Optimizations
 144
Instruction Scheduling by Latency
 145
Loop Unrolling
Loop Unrolling
 Example Complete Loop Unrolling
Complete Loop Unrolling
Partial Loop Unrolling
Example Partial Loop Unrolling
 147
Fadd
 148
Deriving the Loop Control for Partially Unrolled Loops
 149
Inline Functions
 150
Additional Recommendations
 151
Address-Generation Interlocks
Address-Generation Interlocks
 152
 153
Movzx and Movsx
 154
Pointer Arithmetic in Loops
 155
 156
 157
Pushing Memory Data Directly onto the Stack
 158
 159
Chapter Integer Optimizations
 Multiplication by Reciprocal Division Utility
Replacing Division with Multiplication
Signed Division Utility
Unsigned Division Utility
 Algorithm Divisors 1 = d 231, Odd d
Unsigned Division by Multiplication of Constant
Algorithm Divisors 231 = d
161
 Simpler Code for Restricted Dividend
Signed Division by Multiplication of Constant
Algorithm Divisors 2 = d
Signed Division by
 Signed Division by -2n
Signed Division by 2n
Remainder of Signed Division by 2 or
Remainder of Signed Division by 2n or -2n
 164
Alternative Code for Multiplying by a Constant
 165
 166
 Latency of Repeated String Instructions
Repeated String Instructions
Guidelines for Repeated String Instructions
167
 168
Use the Largest Possible Operand Size
 169
Using XOR to Clear Integer Registers
Acceptable
 Bit Addition
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Bit Subtraction
Bit Negation
 Bit Multiplication
Bit Right Shift
Bit Unsigned Division
171
 172
 173
Bit Signed Division
 174
Bit Unsigned Remainder Computation
 175
 176
Bit Signed Remainder Computation
 177
 178
 179
 180
Integer Version
 181
Efficient Binary-to-ASCII Decimal Conversion
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
 182
 183
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
 184
 185
 186
Unsigned Integer Division
 187
 188
 189
Signed Integer Division
Example Code
 190
 191
 192
Optimizing Integer Division
 193
Optimizing with Simd Instructions
 194
 195
Ensure All Packed Floating-Point Data are Aligned
 196
Rationale-Single Precision
 197
Rational-Double Precision
 198
Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
 199
Use Movapd and Movaps Instead of Movupd and Movups
 Loop type Description
200
 201
 202
Double-Precision 32 ⋅ 32 Matrix Multiplication
 203
 204
 205
 206
 207
 208
Passing Data between MMX and 3DNow! Instructions
 209
Storing Floating-Point Data in MMX Registers
 210
Emms and Femms Usage
 XMM Text Segment
211
 212
 213
 214
 215
Single Precision
Double Precision
 216
Clearing MMX and XMM Registers with XOR Instructions
 217
 218
 219
 220
 221
 Code below Puts the Floating Point Sign Mask
222
 223
 224
 225
 226
 Pfpnacc
227
 228
 229
 230
 Listing 27 ⋅ 4 Matrix Multiplication SSE
231
 XMM3
232
 Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology
233
 234
 235
 236
 237
X87 Floating-Point Optimizations
 238
Using Multiplication Rather Than Division
 239
Achieving Two Floating-Point Operations per Clock Cycle
 240
 241
 242
Align and Pack DirectPath x87 Instructions
 243
 244
Floating-Point Compare Instructions
 245
Using the Fxch Instruction Rather Than FST/FLD Pairs
 246
Floating-Point Subexpression Elimination
 247
Accumulating Precision-Sensitive Quantities in x87 Registers
 248
Avoiding Extended-Precision Data
 249
 Key Microarchitecture Features
 251
Processor Block Diagram
Superscalar Processor
 L1 Instruction Cache
AMD Athlon 64 and AMD Opteron Processors Block Diagram
 253
L1 Instruction Cache Specifications by Processor
Branch-Prediction Table
 Fetch-Decode Unit
L1 Instruction TLB Specifications
Instruction Control Unit
Translation-Lookaside Buffer
 10 L1 Data Cache
L1 Data TLB Specifications
L2 TLB Specifications
 Integer Execution Unit
L1 Data Cache Specifications by Processor
Integer Scheduler
 257
Floating-Point Scheduler
 Floating-Point Unit
Floating-Point Execution Unit
Load-Store Unit
 259
16 L2 Cache
 HyperTransport Technology Interface
Buses for AMD Athlon 64 and AMD Opteron Processor
Integrated Memory Controller
 261
HyperTransport Technology
 Software Optimization Guide for AMD64 Processors
 263
Write-Combining Definitions and Abbreviations
 264
Programming Details
Write-combining Operations
 265
Write-Combining Completion Events
 266
Sending Write-Buffer Data to the System
 267
Optimizations
 268
 269
Appendix C Instruction Latencies
 Example Instruction Entry
Understanding Instruction Entries
Parts of the Instruction Entry
270
 271
Interpreting Placeholders
 272
Interpreting Latencies
 Integer Instructions
Integer Instructions
273
AAA
 ADD reg16/32/64, mem16/32/64
274
 Bswap EAX/RAX/R8
275
 276
 277
 Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64
278
 CMP reg16/32/64, mreg16/32/64
279
 280
 281
 282
 283
 JA/JNBE disp16/32
284
 Lahf
285
 286
 Mfence
287
 288
 NOP Xchg EAX, EAX
289
 290
 291
 292
 Rdmsr
293
Rdpmc
Rdtsc
 Sahf
294
 SBB reg16/32/64, mem16/32/64
295
 296
 297
 298
 STC
299
STD
STI
 Syscall
300
Sysenter
Sysexit
 301
 Xchg AX/EAX/RAX, DI/EDI/RDI/R15
302
Xchg AX/EAX/RAX, SI/ESI/RSI/R14
 303
MMX Technology Instructions
MMX Technology Instructions
 Fmul
304
 305
 306
 307
X87 Floating-Point Instructions
X87 Floating-Point Instructions
 Fadd Fcompp
308
Fadd Fcos
Fdecstp
 FADD/FMUL Fstore Finit
309
Fincstp
 310
 311
 312
 Fxch
313
Fxtract
FYL2X
 3DNow! Technology Instructions
3DNow! Technology Instructions
314
Femms
 315
DNow! Technology Instructions
 316
3DNow! Technology Extensions
3DNow! Technology Extensions
 317
SSE Instructions
SSE Instructions
 318
 319
 320
 321
 Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath
322
 Sfence
323
 324
 325
 326
SSE2 Instructions
SSE2 Instructions
 Fmul Fstore
327
0FH
 328
 Maskmovdqu
329
 Fadd Fmul Fstore
330
 331
 332
 Fadd Fmul
333
 334
 335
 336
 337
 338
 Fmul Punpckhdq
339
Fmul Punpckhbw
 340
 341
 342
SSE3 Instructions
SSE3 Instructions
 343
 344
 345
Fast-Write Optimizations
 346
Fast-Write Optimizations for Graphics-Engine Programming
 347
Cacheable-Memory Command Structure
 348
 349
Fast-Write Optimizations for Video-Memory Copies
 350
 351
Memory Optimizations
 352
Northbridge Command Flow
 353
Optimizations for Texture-Map Copies to AGP Memory
Optimizations for Vertex-Geometry Copies to AGP Memory
 354
 355
Types of XMM-Register Data
Types of SSE and SSE2 Instructions
 356
Half-Register Operations
 357
Zeroing Out an XMM Register
Clearing XMM Registers
 358
 359
Reuse of Dead Registers
 360
Moving Data Between XMM Registers and GPRs
 361
Saving and Restoring Registers of Unknown Format
 362
SSE and SSE2 Copy Loops
 363
Explicit Load Instructions
 364
Data Conversion
Converting Scalar Values
 Converting Directly from Memory
Converting Vector Values
365
INT GPR FPS
 366
 367
Numerics
 368 Index