Main
Page
Contents
Page
Page
Page
Page
Page
Page
Page
Tab le s
Page
Figures
Page
Revision History xv
Revision History
Page
Chapter 1 Introduction
1.1 Intended Audience
1.2 Getting Started Quickly
1.3 Using This Guide
Special Information
Numbering Systems
Typographic Notation
Providing Feedback
1.4 Important New Terms
Primitive Operations
Internal Instruction Formats
Chapter 1 Introduction 5
Types of Instructions
1.5 Key Optimizations
Guideline
Key Optimizations by Rank
Chapter 2 C and C++ Source-Level Optimizations
Page
2.1 Declarations of Floating-Point Values
2.2 Using Arrays and Pointers
Chapter 2 C and C++ Source-Level Optimizations 11
Avoid code, such as the following, which uses pointer notation:
Additional Considerations
2.3 Unrolling Small Loops
2.4 Expression Order in Compound Branch Conditions
Page
2.5 Long Logical Expressions in If Statements
2.6 Arrange Boolean Operands for Quick Expression Evaluation
Page
2.7 Dynamic Memory Allocation Consideration
2.8 Unnecessary Store-to-Load Dependencies
Page
2.9 Matching Store and Load Size
Chapter 2 C and C++ Source-Level Optimizations 23
Listing 6. Preferred
24 C and C++ Source-Level Optimizations Chapter 2
The object code generated for the source code between label1 and label2 typically looks like this:
2.10 SWITCH and Noncontiguous Case Expressions
26 C and C++ Source-Level Optimizations Chapter 2
Page
2.11 Arranging Cases by Probability of Occurrence
2.12 Use of Function Prototypes
2.13 Use of const Type Qualifier
2.14 Generic Loop Hoisting
Rationale and Examples
Page
Page
2.15 Local Static Functions
2.16 Explicit Parallelism in Code
Rationale and Examples
Page
2.17 Extracting Common Subexpressions
Page
2.18 Sorting and Padding C and C++ Structures
Page
2.19 Sorting Local Variables
Page
2.20 Replacing Integer Division with Multiplication
2.21 Frequently Dereferenced Pointer Arguments
Chapter 2 C and C++ Source-Level Optimizations 45
Listing 17. Preferred
2.22 Array Indices
2.23 32-Bit Integral Data Types
2.24 Sign of Integer Operands
Page
2.25 Accelerating Floating-Point Division and Square Root
Chapter 2 C and C++ Source-Level Optimizations 51
Listing 22.
2.26 Fast Floating-Point-to-Integer Conversion
Page
2.27 Speeding Up Branches Based on Comparisons Between Floats
Page
Page
2.28 Improving Performance in Linux Libraries
Page
Chapter 3 General 64-Bit Optimizations
3.1 64-Bit Registers and Integer Arithmetic
Page
3.2 64-Bit Arithmetic and Large-Integer Multiplication
Background
Page
64 General 64-Bit Optimizations Chapter 3
Chapter 3 General 64-Bit Optimizations 65
66 General 64-Bit Optimizations Chapter 3
Page
3.4 32-Bit Legacy GPRs and Small Unsigned Integers
Caution
Page
Page
Chapter 4 Instruction-Decoding Optimizations
4.1 DirectPath Instructions
4.2 Load-Execute Instructions
4.2.1 Load-Execute Integer Instructions
Page
Page
4.3 Branch Targets in Program Hot Spots
4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction
4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes
Page
4.6 Short Instruction Encodings
4.7 Partial-Register Reads and Writes
Page
4.8 Using LEAVE for Function Epilogues
Background
Page
4.9 Alternatives to SHLD Instruction
86 Instruction-Decoding Optimizations Chapter 4
Example 3
Replace this instruction:
with this code sequence:
4.10 8-Bit Sign-Extended Immediate Values
4.11 8-Bit Sign-Extended Displacements
4.12 Code Padding with Operand-Size Override and NOP
Page
Chapter 5 Cache and Memory Optimizations
5.1 Memory-Size Mismatches
ExamplesStore-to-Load-Forwarding Stalls
ExamplesLarge-to-small Mismatches
Page
5.2 Natural Alignment of Data Objects
5.3 Cache-Coherent Nonuniform Memory Access (ccNUMA)
Page
OS Implications
5.4 Multiprocessor Considerations
5.5 Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding PitfallsTrue Dependencies
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
High-Byte Store-Buffer Data-Forwarding Restriction
One Supported Store-to-Load Forwarding Case
Store-to-Load ForwardingFalse Dependencies
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
5.6 Prefetch Instructions
Page
Page
Page
Page
Chapter 5 Cache and Memory Optimizations 109
25112 Rev. 3.06 September 2005
Prefetchnta [ esi + 64 * 2]
C1 C5C2 C3 C4
]
C1 C5C2 C3 C4
Prefetchnta [ esi + 64 *
]
Prefetchnta [ esi + 64 *
Figure 5. Processor-Limited Code
Page
Page
5.7 Streaming-Store/Non-Temporal Instructions
5.8 Write-combining
5.9 L1 Data Cache Bank Conflicts
Fields Used to Address the Multibank L1 Data Cache
How to Know If a Bank Conflict Exists
Page
5.10 Placing Code and Data in the Same 64-Byte Cache Line
5.11 Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
Page
5.12 Sorting Local Variables
5.13 Memory Copy
Chapter 5 Cache and Memory Optimizations 121
5.14 Stack Considerations
5.15 Cache Issues when Writing Instruction Bytes to
5.16 Interleave Loads and Stores
Chapter 6 Branch Optimizations
In This Chapter
6.1 Density of Branches
Page
6.2 Two-Byte Near-Return RET Instruction
Page
6.3 Branches That Depend on Random Data
Chapter 6 Branch Optimizations 131
Conditional Write
6.4 Pairing CALL and RETURN
6.5 Recursive Functions
Page
6.6 Nonzero Code-Segment Base Values
6.7 Replacing Branches with Computation
Muxing Constructs
Chapter 6 Branch Optimizations 137
SSE Solution (Preferred)
MMX Solution (Avoid)
Sample Code Translated into AMD64 Code
Chapter 6 Branch Optimizations 139
Example 3: C Code
Example 3: 3DNow! Code
Example 4: C Code
Example 4: 3DNow! Code
140 Branch Optimizations Chapter 6
Example 5: C Code
Example 5: 3DNow! Code
6.8 The LOOP Instruction
6.9 Far Control-Transfer Instructions
Chapter 7 Scheduling Optimizations
7.1 Instruction Scheduling by Latency
7.2 Loop Unrolling
Loop Unrolling
Complete Loop Unrolling
Example: Complete Loop Unrolling
Partial Loop Unrolling
Example: Partial Loop Unrolling
Page
Deriving the Loop Control for Partially Unrolled Loops
7.3 Inline Functions
Additional Recommendations
7.4 Address-Generation Interlocks
Page
7.5 MOVZX and MOVSX
7.6 Pointer Arithmetic in Loops
Page
Page
7.7 Pushing Memory Data Directly onto the Stack
Page
Chapter 8 Integer Optimizations
8.1 Replacing Division with Multiplication
Multiplication by Reciprocal (Division) Utility
Unsigned Division by Multiplication of Constant
Signed Division by Multiplication of Constant
Chapter 8 Integer Optimizations 163
Signed Division by 2
Signed Division by 2
Signed Division by (2n)
Remainder of Signed Division by 2 or 2
8.2 Alternative Code for Multiplying by a Constant
Chapter 8 Integer Optimizations 165
166 Integer Optimizations Chapter 8
8.3 Repeated String Instructions
Rational
Page
8.4 Using XOR to Clear Integer Registers
8.5 Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Chapter 8 Integer Optimizations 171
64-Bit Right Shift
64-Bit Multiplication
64-Bit Unsigned Division
172 Integer Optimizations Chapter 8
Chapter 8 Integer Optimizations 173
64-Bit Signed Division
174 Integer Optimizations Chapter 8
64-Bit Unsigned Remainder Computation
Chapter 8 Integer Optimizations 175
176 Integer Optimizations Chapter 8
64-Bit Signed Remainder Computation
Chapter 8 Integer Optimizations 177
178 Integer Optimizations Chapter 8
8.6 Efficient Implementation of Population-Count Function in 32-Bit Mode
180 Integer Optimizations Chapter 8
Integer Version
8.7 Efficient Binary-to-ASCII Decimal Conversion
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
182 Integer Optimizations Chapter 8
Chapter 8 Integer Optimizations 183
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
184 Integer Optimizations Chapter 8
Chapter 8 Integer Optimizations 185
8.8 Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by Constants
Unsigned Integer Division
Chapter 8 Integer Optimizations 187
188 Integer Optimizations Chapter 8
Chapter 8 Integer Optimizations 189
Signed Integer Division
190 Integer Optimizations Chapter 8
Chapter 8 Integer Optimizations 191
8.9 Optimizing Integer Division
Chapter 9 Optimizing with SIMD Instructions
Page
9.1 Ensure All Packed Floating-Point Data are Aligned
RationaleSingle Precision
RationalDouble Precision
9.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
9.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS
9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency
Page
Page
Chapter 9 Optimizing with SIMD Instructions 203
204 Optimizing with SIMD Instructions Chapter 9
Chapter 9 Optimizing with SIMD Instructions 205
9.6 Avoid Moving Data Directly Between General-Purpose and MMX Registers
9.7 Use MMX Instructions to Construct Fast Block- Copy Routines in 32-Bit Mode
9.8 Passing Data between MMX and 3DNow!
9.9 Storing Floating-Point Data in MMX Registers
9.10 EMMS and FEMMS Usage
9.11 Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots
212 Optimizing with SIMD Instructions Chapter 9
Chapter 9 Optimizing with SIMD Instructions 213
The preceding code illustrates the use of separate loops for optimal performance. The loop titled
keep the processor busy by masking the latencies of the reciprocal and square-root instructions. In
Page
9.12 Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow! Instructions
9.13 Clearing MMX and XMM Registers with XOR
9.14 Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNow!
9.15 Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!
Chapter 9 Optimizing with SIMD Instructions 219
220 Optimizing with SIMD Instructions Chapter 9
9.16 Complex-Number Arithmetic Using SSE, SSE2, and 3DNow! Instructions
222 Optimizing with SIMD Instructions Chapter 9
Listing 25. Complex Multiplication of Streams of Complex Numbers (SSE)
Chapter 9 Optimizing with SIMD Instructions 223
224 Optimizing with SIMD Instructions Chapter 9
Listing 26. Complex Multiplication of Streams of Complex Numbers (3DNow! Technology)
Chapter 9 Optimizing with SIMD Instructions 225
226 Optimizing with SIMD Instructions Chapter 9
Chapter 9 Optimizing with SIMD Instructions 227
The PFPNACC instruction is specifically designed for use in complex arithmetic operations.
Page
Page
9.17 Optimized 4 4 Matrix Multiplication on 4 1 Column Vector Routines
Chapter 9 Optimizing with SIMD Instructions 231
Listing 27. 4 4 Matrix Multiplication (SSE)
232 Optimizing with SIMD Instructions Chapter 9
Chapter 9 Optimizing with SIMD Instructions 233
Listing 28. 4 4 Matrix Multiplication (3DNow! Technology)
234 Optimizing with SIMD Instructions Chapter 9
Chapter 9 Optimizing with SIMD Instructions 235
Page
Chapter 10 x87 Floating-Point Optimizations
10.1 Using Multiplication Rather Than Division
10.2 Achieving Two Floating-Point Operations per Clock Cycle
Page
Page
Page
Page
10.3 Floating-Point Compare Instructions
10.4 Using the FXCH Instruction Rather Than FST/FLD Pairs
10.5 Floating-Point Subexpression Elimination
10.6 Accumulating Precision-Sensitive Quantities in x87 Registers
10.7 Avoiding Extended-Precision Data
Appendix A Microarchitecture for AMD Athlon 64 and AMD Opteron Processors
A.1 Key Microarchitecture Features
Page
A.5 L1 Instruction Cache
A.6 Branch-Prediction Table
A.7 Fetch-Decode Unit
A.8 Instruction Control Unit
A.9 Translation-Lookaside Buffer
L1 Instruction TLB Specifications
L1 Data TLB Specifications
A.10 L1 Data Cache
A.11 Integer Scheduler
A.12 Integer Execution Unit
A.13 Floating-Point Scheduler
A.14 Floating-Point Execution Unit
A.15 Load-Store Unit
A.16 L2 Cache
A.17 Write-combining
A.18 Buses for AMD Athlon 64 and AMD Opteron Processor
A.19 Integrated Memory Controller
A.20 HyperTransport Technology Interface
HyperTransport Technology
Page
Appendix B Implementation of Write-Combining
B.1 Write-Combining Definitions and Abbreviations
B.2 Programming Details
B.3 Write-combining Operations
Page
B.4 Sending Write-Buffer Data to the System
B.5 Write-Combining Optimization on Revision D and E AMD Athlon 64 and AMD Opteron Processors
Optimizations
Page
Appendix C Instruction Latencies
C.1 Understanding Instruction Entries
Example: Instruction Entry
Parts of the Instruction Entry
Appendix C Instruction Latencies 271
Interpreting Placeholders
Interpreting Latencies
Appendix C Instruction Latencies 273
C.2 Integer Instructions
Table 13. Integer Instructions
274 Instruction Latencies Appendix C
Appendix C Instruction Latencies 275
276 Instruction Latencies Appendix C
Appendix C Instruction Latencies 277
278 Instruction Latencies Appendix C
Appendix C Instruction Latencies 279
280 Instruction Latencies Appendix C
Appendix C Instruction Latencies 281
282 Instruction Latencies Appendix C
Appendix C Instruction Latencies 283
284 Instruction Latencies Appendix C
Appendix C Instruction Latencies 285
286 Instruction Latencies Appendix C
Appendix C Instruction Latencies 287
288 Instruction Latencies Appendix C
Appendix C Instruction Latencies 289
290 Instruction Latencies Appendix C
Appendix C Instruction Latencies 291
292 Instruction Latencies Appendix C
Appendix C Instruction Latencies 293
294 Instruction Latencies Appendix C
Appendix C Instruction Latencies 295
296 Instruction Latencies Appendix C
Appendix C Instruction Latencies 297
298 Instruction Latencies Appendix C
Appendix C Instruction Latencies 299
300 Instruction Latencies Appendix C
Appendix C Instruction Latencies 301
302 Instruction Latencies Appendix C
Appendix C Instruction Latencies 303
C.3 MMX Technology Instructions
Table 14. MMX Technology Instructions
304 Instruction Latencies Appendix C
Appendix C Instruction Latencies 305
306 Instruction Latencies Appendix C
Appendix C Instruction Latencies 307
C.4 x87 Floating-Point Instructions
Table 15. x87 Floating-Point Instructions
308 Instruction Latencies Appendix C
Appendix C Instruction Latencies 309
310 Instruction Latencies Appendix C
Appendix C Instruction Latencies 311
312 Instruction Latencies Appendix C
Appendix C Instruction Latencies 313
314 Instruction Latencies Appendix C
C.5 3DNow! Technology Instructions
Table 16. 3DNow! Technology Instructions
Appendix C Instruction Latencies 315
Table 16. 3DNow! Technology Instructions (Continued)
316 Instruction Latencies Appendix C
C.6 3DNow! Technology Extensions
Table 17. 3DNow! Technology Extensions
Appendix C Instruction Latencies 317
C.7 SSE Instructions
Table 18. SSE Instructions
318 Instruction Latencies Appendix C
Appendix C Instruction Latencies 319
320 Instruction Latencies Appendix C
Appendix C Instruction Latencies 321
322 Instruction Latencies Appendix C
Appendix C Instruction Latencies 323
324 Instruction Latencies Appendix C
Appendix C Instruction Latencies 325
326 Instruction Latencies Appendix C
C.8 SSE2 Instructions
Table 19. SSE2 Instructions
Appendix C Instruction Latencies 327
328 Instruction Latencies Appendix C
Appendix C Instruction Latencies 329
330 Instruction Latencies Appendix C
Appendix C Instruction Latencies 331
332 Instruction Latencies Appendix C
Appendix C Instruction Latencies 333
334 Instruction Latencies Appendix C
Appendix C Instruction Latencies 335
336 Instruction Latencies Appendix C
Appendix C Instruction Latencies 337
338 Instruction Latencies Appendix C
Appendix C Instruction Latencies 339
340 Instruction Latencies Appendix C
Appendix C Instruction Latencies 341
342 Instruction Latencies Appendix C
C.9 SSE3 Instructions
Table 20. SSE3 Instructions
Appendix C Instruction Latencies 343
Table 20. SSE3 Instructions (Continued)
Page
Appendix D AGP Considerations
D.1 Fast-Write Optimizations
D.2 Fast-Write Optimizations for Graphics-Engine Programming
Page
Page
D.3 Fast-Write Optimizations for Video-Memory Copies
350 AGP Considerations Appendix D
D.4 Memory Optimizations
352 AGP Considerations Appendix D
25112 Rev. 3.06 September 2005
Figure 12. Northbridge Command Flow
D.5 Memory Optimizations for Graphics-Engine Programming Using the DMA Model
D.6 Optimizations for Texture-Map Copies to AGP
D.7 Optimizations for Vertex-Geometry Copies to AGP
Page
Appendix E SSE and SSE2 Optimizations
Types of XMM-Register Data
Types of SSE and SSE2 Instructions
E.1 Half-Register Operations
E.2 Zeroing Out an XMM Register
Page
E.3 Reuse of Dead Registers
E.4 Moving Data Between XMM Registers and GPRs
E.5 Saving and Restoring Registers of Unknown Format
E.6 SSE and SSE2 Copy Loops
E.7 Explicit Load Instructions
E.8 Data Conversion
Appendix E SSE and SSE2 Optimizations 365
Table 23. Converting Vector Values
Table 24. Converting Directly from Memory
Table 22. Converting Scalar Values (Continued)
Page
Index 367
Index
Numerics
A
B
C
M
N
O
P
R