Software Optimization Guide For AMD64 Processors
 2001 2005 Advanced Micro Devices, Inc. All rights reserved
 Contents
 General 64-Bit Optimizations
 Chapter Cache and Memory Optimizations
 Chapter Integer Optimizations 159
 Chapter X87 Floating-Point Optimizations 237
 Viii
Appendix B Implementation of Write-Combining
 Index
 Software Optimization Guide for AMD64 Processors
 Tables
Tables
 Xii Tables
 Figures
 Xiv
 Revision History
 Xvi Revision History
 Getting Started Quickly
Intended Audience
 Using This Guide
 Numbering Systems
Special Information
Typographic Notation
Providing Feedback
 Important New Terms
Primitive Operations
Internal Instruction Formats
 Types of Instructions
Instructions, Macro-ops and Micro-ops
Mrom
 Guideline
Key Optimizations
Key Optimizations by Rank
Optimizations by Rank
 C++ Source-Level Optimizations
 C++ Source-Level Optimizations
 Optimization
Declarations of Floating-Point Values
Application
Rationale
 Using Arrays and Pointers
 Matrix
Example
 Instead, use the equivalent array notation
Additional Considerations
 Related Information
Unrolling Small Loops
 Expression Order in Compound Branch Conditions
 Chapter C++ Source-Level Optimizations
 Long Logical Expressions in If Statements
 Arrange Boolean Operands for Quick Expression Evaluation
 If *p == y && strlenp
 Dynamic Memory Allocation Consideration
 Listing 3. Avoid
Unnecessary Store-to-Load Dependencies
 Application
 Examples
Matching Store and Load Size
 Listing 6. Preferred
 C++ Source-Level Optimizations
 Switch and Noncontiguous Case Expressions
 Example
 Related Information
 Arranging Cases by Probability of Occurrence
 Use of Function Prototypes
 Use of const Type Qualifier
 Rationale and Examples
Generic Loop Hoisting
 Listing
 Chapter C++ Source-Level Optimizations
 Local Static Functions
 Explicit Parallelism in Code
 Listing 11. Preferred
 Extracting Common Subexpressions
 Listing 15. Example 2 Preferred
 Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
 C++ Source-Level Optimizations
 Sorting Local Variables
 Related Information
 Replacing Integer Division with Multiplication
 Listing 16. Avoid
Frequently Dereferenced Pointer Arguments
 Listing 17. Preferred
 Array Indices
 Rational
23 32-Bit Integral Data Types
 Sign of Integer Operands
 Listing 20. Example 2 Avoid
 Accelerating Floating-Point Division and Square Root
 Examples
 Fast Floating-Point-to-Integer Conversion
 Listing 23. Slow
 Branches Dependent on Integer Comparisons Are Fast
Speeding Up Branches Based on Comparisons Between Floats
 Comparisons against Positive Constant
 Comparisons among Two Floats
 Improving Performance in Linux Libraries
 Software Optimization Guide for AMD64 Processors
 General 64-Bit Optimizations
 64-Bit Registers and Integer Arithmetic
This code performs 64-bit addition using 32-bit registers
 ESP+8ESP+4 = multiplicand
 Background
64-Bit Arithmetic and Large-Integer Multiplication
 G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0
 XMM
 END
 Text Segment
 128-Bit Media Instructions and Floating-Point Operations
 32-Bit Legacy GPRs and Small Unsigned Integers
 Chapter General 64-Bit Optimizations
 General 64-Bit Optimizations
 Instruction-Decoding Optimizations
 DirectPath Instructions
 Load-Execute Integer Instructions Optimization
Load-Execute Instructions
 Movss xmm0, floatvar1 mulss xmm0, floatvar2
 Application
 Branch Targets in Program Hot Spots
 32/64-Bit vs -Bit Forms of the LEA Instruction
 Take Advantage of x86 and AMD64 Complex Addressing Modes
 Cmpb %al,0x68e35%r10,%r13
 Short Instruction Encodings
 Partial-Register Reads and Writes
 Avoid
 Using Leave for Function Epilogues
Functions That Allocate Local Variables
Functions That Do not Allocate Local Variables
 Traditional function epilogue looks like this
 Alternatives to Shld Instruction
 Lea reg1, reg1*8+reg2
 10 8-Bit Sign-Extended Immediate Values
 11 8-Bit Sign-Extended Displacements
 NOP
Code Padding with Operand-Size Override
 Instruction-Decoding Optimizations
 Cache and Memory Optimizations
 Examples-Store-to-Load-Forwarding Stalls
Memory-Size Mismatches
Bit Avoid
Avoid
 Preferred If the Contents of MM0 are No Longer Needed
Examples-Large-to-small Mismatches
Preferred If Stores Are Close to the Load
 Preferred If the Stores and Loads are Close Together, Option
 Natural Alignment of Data Objects
 Cache-Coherent Nonuniform Memory Access ccNUMA
 CPU0
 OS Implications
Dual-Core AMD Opteron Processor Configuration
 Multiprocessor Considerations
 Store-to-Load Forwarding Pitfalls-True Dependencies
Store-to-Load Forwarding Restrictions
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
100
 Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
101
 One Supported Store-to-Load Forwarding Case
High-Byte Store-Buffer Data-Forwarding Restriction
Store-to-Load Forwarding-False Dependencies
102
 103
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
 Prefetch Instructions
Prefetching versus Preloading
104
 Hardware Prefetching
Unit-Stride Access
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
105
 Prefetchw versus Prefetch
Write-Combining Usage
106
 107
Multiple Prefetches
 108
Determining Prefetch Distance
 Processor-Limited Code
Memory-Limited Code
 Cache and Memory Optimizations
Definitions
 111
Prefetch at Least 64 Bytes Away from Surrounding Stores
 Streaming-Store/Non-Temporal Instructions
 113
Write-combining
 How to Know If a Bank Conflict Exists
L1 Data Cache Bank Conflicts
Fields Used to Address the Multibank L1 Data Cache
 115
 Placing Code and Data in the Same 64-Byte Cache Line
 117
Sorting and Padding C and C++ Structures
 118
 119
 Memory Copy
Copying Small Data Structures
120
 121
 Extend Arguments to 32 Bits Before Pushing onto Stack
Stack Considerations
Optimized Stack Usage
122
 123
Cache Issues when Writing Instruction Bytes to Memory
 124
Interleave Loads and Stores
 125
This Chapter
 126
Density of Branches
 Align
127
 128
Two-Byte Near-Return RET Instruction
 129
 Signed Integer ABS Function x = labsx
Branches That Depend on Random Data
Unsigned Integer min Function z = x y ? x y
130
 131
Conditional Write
 132
Pairing Call and Return
 133
Recursive Functions
 134
 135
Nonzero Code-Segment Base Values
 Replacing Branches with Computation
Muxing Constructs
136
 SSE Solution Preferred
MMX Solution Avoid
137
 Example 1 C Code
Sample Code Translated into AMD64 Code
Example 1 3DNow! Code
Example 2 C Code
 Example 3 3DNow! Code
Example 3 C Code
Example 4 C Code
Example 4 3DNow! Code
 Example 5 C Code
Example 5 3DNow! Code
140
 141
Loop Instruction
 142
Far Control-Transfer Instructions
 143
Chapter Scheduling Optimizations
 144
Instruction Scheduling by Latency
 Loop Unrolling
Loop Unrolling
145
 Example Complete Loop Unrolling
Complete Loop Unrolling
Partial Loop Unrolling
Example Partial Loop Unrolling
 147
Fadd
 148
Deriving the Loop Control for Partially Unrolled Loops
 149
Inline Functions
 150
Additional Recommendations
 Address-Generation Interlocks
Address-Generation Interlocks
151
 152
 153
Movzx and Movsx
 154
Pointer Arithmetic in Loops
 155
 156
 157
Pushing Memory Data Directly onto the Stack
 158
 159
Chapter Integer Optimizations
 Multiplication by Reciprocal Division Utility
Replacing Division with Multiplication
Signed Division Utility
Unsigned Division Utility
 Algorithm Divisors 1 = d 231, Odd d
Unsigned Division by Multiplication of Constant
Algorithm Divisors 231 = d
161
 Simpler Code for Restricted Dividend
Signed Division by Multiplication of Constant
Algorithm Divisors 2 = d
Signed Division by
 Signed Division by -2n
Signed Division by 2n
Remainder of Signed Division by 2 or
Remainder of Signed Division by 2n or -2n
 164
Alternative Code for Multiplying by a Constant
 165
 166
 Latency of Repeated String Instructions
Repeated String Instructions
Guidelines for Repeated String Instructions
167
 168
Use the Largest Possible Operand Size
 Using XOR to Clear Integer Registers
Acceptable
169
 Bit Addition
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Bit Subtraction
Bit Negation
 Bit Multiplication
Bit Right Shift
Bit Unsigned Division
171
 172
 173
Bit Signed Division
 174
Bit Unsigned Remainder Computation
 175
 176
Bit Signed Remainder Computation
 177
 178
 179
 180
Integer Version
 Efficient Binary-to-ASCII Decimal Conversion
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
181
 182
 183
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
 184
 185
 186
Unsigned Integer Division
 187
 188
 Signed Integer Division
Example Code
189
 190
 191
 192
Optimizing Integer Division
 193
Optimizing with Simd Instructions
 194
 195
Ensure All Packed Floating-Point Data are Aligned
 196
Rationale-Single Precision
 197
Rational-Double Precision
 198
Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
 199
Use Movapd and Movaps Instead of Movupd and Movups
 Loop type Description
200
 201
 202
Double-Precision 32 ⋅ 32 Matrix Multiplication
 203
 204
 205
 206
 207
 208
Passing Data between MMX and 3DNow! Instructions
 209
Storing Floating-Point Data in MMX Registers
 210
Emms and Femms Usage
 XMM Text Segment
211
 212
 213
 214
 Single Precision
Double Precision
215
 216
Clearing MMX and XMM Registers with XOR Instructions
 217
 218
 219
 220
 221
 Code below Puts the Floating Point Sign Mask
222
 223
 224
 225
 226
 Pfpnacc
227
 228
 229
 230
 Listing 27 ⋅ 4 Matrix Multiplication SSE
231
 XMM3
232
 Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology
233
 234
 235
 236
 237
X87 Floating-Point Optimizations
 238
Using Multiplication Rather Than Division
 239
Achieving Two Floating-Point Operations per Clock Cycle
 240
 241
 242
Align and Pack DirectPath x87 Instructions
 243
 244
Floating-Point Compare Instructions
 245
Using the Fxch Instruction Rather Than FST/FLD Pairs
 246
Floating-Point Subexpression Elimination
 247
Accumulating Precision-Sensitive Quantities in x87 Registers
 248
Avoiding Extended-Precision Data
 249
 Key Microarchitecture Features
 Processor Block Diagram
Superscalar Processor
251
 L1 Instruction Cache
AMD Athlon 64 and AMD Opteron Processors Block Diagram
 L1 Instruction Cache Specifications by Processor
Branch-Prediction Table
253
 Fetch-Decode Unit
L1 Instruction TLB Specifications
Instruction Control Unit
Translation-Lookaside Buffer
 L1 Data TLB Specifications
L2 TLB Specifications
10 L1 Data Cache
 L1 Data Cache Specifications by Processor
Integer Scheduler
Integer Execution Unit
 257
Floating-Point Scheduler
 Floating-Point Execution Unit
Load-Store Unit
Floating-Point Unit
 259
16 L2 Cache
 Buses for AMD Athlon 64 and AMD Opteron Processor
Integrated Memory Controller
HyperTransport Technology Interface
 261
HyperTransport Technology
 Software Optimization Guide for AMD64 Processors
 263
Write-Combining Definitions and Abbreviations
 Programming Details
Write-combining Operations
264
 265
Write-Combining Completion Events
 266
Sending Write-Buffer Data to the System
 267
Optimizations
 268
 269
Appendix C Instruction Latencies
 Example Instruction Entry
Understanding Instruction Entries
Parts of the Instruction Entry
270
 271
Interpreting Placeholders
 272
Interpreting Latencies
 Integer Instructions
Integer Instructions
273
AAA
 ADD reg16/32/64, mem16/32/64
274
 Bswap EAX/RAX/R8
275
 276
 277
 Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64
278
 CMP reg16/32/64, mreg16/32/64
279
 280
 281
 282
 283
 JA/JNBE disp16/32
284
 Lahf
285
 286
 Mfence
287
 288
 NOP Xchg EAX, EAX
289
 290
 291
 292
 Rdmsr
293
Rdpmc
Rdtsc
 Sahf
294
 SBB reg16/32/64, mem16/32/64
295
 296
 297
 298
 STC
299
STD
STI
 Syscall
300
Sysenter
Sysexit
 301
 302
Xchg AX/EAX/RAX, SI/ESI/RSI/R14
Xchg AX/EAX/RAX, DI/EDI/RDI/R15
 MMX Technology Instructions
MMX Technology Instructions
303
 Fmul
304
 305
 306
 X87 Floating-Point Instructions
X87 Floating-Point Instructions
307
 Fadd Fcompp
308
Fadd Fcos
Fdecstp
 309
Fincstp
FADD/FMUL Fstore Finit
 310
 311
 312
 Fxch
313
Fxtract
FYL2X
 3DNow! Technology Instructions
3DNow! Technology Instructions
314
Femms
 315
DNow! Technology Instructions
 3DNow! Technology Extensions
3DNow! Technology Extensions
316
 SSE Instructions
SSE Instructions
317
 318
 319
 320
 321
 Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath
322
 Sfence
323
 324
 325
 SSE2 Instructions
SSE2 Instructions
326
 327
0FH
Fmul Fstore
 328
 Maskmovdqu
329
 Fadd Fmul Fstore
330
 331
 332
 Fadd Fmul
333
 334
 335
 336
 337
 338
 339
Fmul Punpckhbw
Fmul Punpckhdq
 340
 341
 SSE3 Instructions
SSE3 Instructions
342
 343
 344
 345
Fast-Write Optimizations
 346
Fast-Write Optimizations for Graphics-Engine Programming
 347
Cacheable-Memory Command Structure
 348
 349
Fast-Write Optimizations for Video-Memory Copies
 350
 351
Memory Optimizations
 352
Northbridge Command Flow
 Optimizations for Texture-Map Copies to AGP Memory
Optimizations for Vertex-Geometry Copies to AGP Memory
353
 354
 Types of XMM-Register Data
Types of SSE and SSE2 Instructions
355
 356
Half-Register Operations
 Zeroing Out an XMM Register
Clearing XMM Registers
357
 358
 359
Reuse of Dead Registers
 360
Moving Data Between XMM Registers and GPRs
 361
Saving and Restoring Registers of Unknown Format
 362
SSE and SSE2 Copy Loops
 363
Explicit Load Instructions
 Data Conversion
Converting Scalar Values
364
 Converting Directly from Memory
Converting Vector Values
365
INT GPR FPS
 366
 367
Numerics
 368 Index