Software Optimization Guide For AMD64 Processors
2001 2005 Advanced Micro Devices, Inc. All rights reserved
Contents
General 64-Bit Optimizations
Chapter Cache and Memory Optimizations
Chapter Integer Optimizations 159
Chapter X87 Floating-Point Optimizations 237
Viii
Appendix B Implementation of Write-Combining
Index
Software Optimization Guide for AMD64 Processors
Tables
Tables
Xii Tables
Figures
Xiv
Revision History
Xvi Revision History
Getting Started Quickly
Intended Audience
Using This Guide
Numbering Systems
Special Information
Typographic Notation
Providing Feedback
Internal Instruction Formats
Important New Terms
Primitive Operations
Mrom
Types of Instructions
Instructions, Macro-ops and Micro-ops
Guideline
Key Optimizations
Key Optimizations by Rank
Optimizations by Rank
C++ Source-Level Optimizations
C++ Source-Level Optimizations
Optimization
Declarations of Floating-Point Values
Application
Rationale
Using Arrays and Pointers
Matrix
Example
Instead, use the equivalent array notation
Additional Considerations
Related Information
Unrolling Small Loops
Expression Order in Compound Branch Conditions
Chapter C++ Source-Level Optimizations
Long Logical Expressions in If Statements
Arrange Boolean Operands for Quick Expression Evaluation
If *p == y && strlenp
Dynamic Memory Allocation Consideration
Listing 3. Avoid
Unnecessary Store-to-Load Dependencies
Application
Examples
Matching Store and Load Size
Listing 6. Preferred
C++ Source-Level Optimizations
Switch and Noncontiguous Case Expressions
Example
Related Information
Arranging Cases by Probability of Occurrence
Use of Function Prototypes
Use of const Type Qualifier
Rationale and Examples
Generic Loop Hoisting
Listing
Chapter C++ Source-Level Optimizations
Local Static Functions
Explicit Parallelism in Code
Listing 11. Preferred
Extracting Common Subexpressions
Listing 15. Example 2 Preferred
Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
C++ Source-Level Optimizations
Sorting Local Variables
Related Information
Replacing Integer Division with Multiplication
Listing 16. Avoid
Frequently Dereferenced Pointer Arguments
Listing 17. Preferred
Array Indices
Rational
23 32-Bit Integral Data Types
Sign of Integer Operands
Listing 20. Example 2 Avoid
Accelerating Floating-Point Division and Square Root
Examples
Fast Floating-Point-to-Integer Conversion
Listing 23. Slow
Branches Dependent on Integer Comparisons Are Fast
Speeding Up Branches Based on Comparisons Between Floats
Comparisons against Positive Constant
Comparisons among Two Floats
Improving Performance in Linux Libraries
Software Optimization Guide for AMD64 Processors
General 64-Bit Optimizations
64-Bit Registers and Integer Arithmetic
This code performs 64-bit addition using 32-bit registers
ESP+8ESP+4 = multiplicand
Background
64-Bit Arithmetic and Large-Integer Multiplication
G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0
XMM
END
Text Segment
128-Bit Media Instructions and Floating-Point Operations
32-Bit Legacy GPRs and Small Unsigned Integers
Chapter General 64-Bit Optimizations
General 64-Bit Optimizations
Instruction-Decoding Optimizations
DirectPath Instructions
Load-Execute Integer Instructions Optimization
Load-Execute Instructions
Movss xmm0, floatvar1 mulss xmm0, floatvar2
Application
Branch Targets in Program Hot Spots
32/64-Bit vs -Bit Forms of the LEA Instruction
Take Advantage of x86 and AMD64 Complex Addressing Modes
Cmpb %al,0x68e35%r10,%r13
Short Instruction Encodings
Partial-Register Reads and Writes
Avoid
Functions That Do not Allocate Local Variables
Using Leave for Function Epilogues
Functions That Allocate Local Variables
Traditional function epilogue looks like this
Alternatives to Shld Instruction
Lea reg1, reg1*8+reg2
10 8-Bit Sign-Extended Immediate Values
11 8-Bit Sign-Extended Displacements
NOP
Code Padding with Operand-Size Override
Instruction-Decoding Optimizations
Cache and Memory Optimizations
Examples-Store-to-Load-Forwarding Stalls
Memory-Size Mismatches
Bit Avoid
Avoid
Preferred If Stores Are Close to the Load
Preferred If the Contents of MM0 are No Longer Needed
Examples-Large-to-small Mismatches
Preferred If the Stores and Loads are Close Together, Option
Natural Alignment of Data Objects
Cache-Coherent Nonuniform Memory Access ccNUMA
CPU0
OS Implications
Dual-Core AMD Opteron Processor Configuration
Multiprocessor Considerations
Store-to-Load Forwarding Pitfalls-True Dependencies
Store-to-Load Forwarding Restrictions
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
100
101
Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
One Supported Store-to-Load Forwarding Case
High-Byte Store-Buffer Data-Forwarding Restriction
Store-to-Load Forwarding-False Dependencies
102
103
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
104
Prefetch Instructions
Prefetching versus Preloading
Hardware Prefetching
Unit-Stride Access
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
105
106
Prefetchw versus Prefetch
Write-Combining Usage
107
Multiple Prefetches
108
Determining Prefetch Distance
Processor-Limited Code
Memory-Limited Code
Cache and Memory Optimizations
Definitions
111
Prefetch at Least 64 Bytes Away from Surrounding Stores
Streaming-Store/Non-Temporal Instructions
113
Write-combining
Fields Used to Address the Multibank L1 Data Cache
How to Know If a Bank Conflict Exists
L1 Data Cache Bank Conflicts
115
Placing Code and Data in the Same 64-Byte Cache Line
117
Sorting and Padding C and C++ Structures
118
119
120
Memory Copy
Copying Small Data Structures
121
Extend Arguments to 32 Bits Before Pushing onto Stack
Stack Considerations
Optimized Stack Usage
122
123
Cache Issues when Writing Instruction Bytes to Memory
124
Interleave Loads and Stores
125
This Chapter
126
Density of Branches
Align
127
128
Two-Byte Near-Return RET Instruction
129
Signed Integer ABS Function x = labsx
Branches That Depend on Random Data
Unsigned Integer min Function z = x y ? x y
130
131
Conditional Write
132
Pairing Call and Return
133
Recursive Functions
134
135
Nonzero Code-Segment Base Values
136
Replacing Branches with Computation
Muxing Constructs
137
SSE Solution Preferred
MMX Solution Avoid
Example 1 C Code
Sample Code Translated into AMD64 Code
Example 1 3DNow! Code
Example 2 C Code
Example 3 3DNow! Code
Example 3 C Code
Example 4 C Code
Example 4 3DNow! Code
140
Example 5 C Code
Example 5 3DNow! Code
141
Loop Instruction
142
Far Control-Transfer Instructions
143
Chapter Scheduling Optimizations
144
Instruction Scheduling by Latency
145
Loop Unrolling
Loop Unrolling
Example Complete Loop Unrolling
Complete Loop Unrolling
Partial Loop Unrolling
Example Partial Loop Unrolling
147
Fadd
148
Deriving the Loop Control for Partially Unrolled Loops
149
Inline Functions
150
Additional Recommendations
151
Address-Generation Interlocks
Address-Generation Interlocks
152
153
Movzx and Movsx
154
Pointer Arithmetic in Loops
155
156
157
Pushing Memory Data Directly onto the Stack
158
159
Chapter Integer Optimizations
Multiplication by Reciprocal Division Utility
Replacing Division with Multiplication
Signed Division Utility
Unsigned Division Utility
Algorithm Divisors 1 = d 231, Odd d
Unsigned Division by Multiplication of Constant
Algorithm Divisors 231 = d
161
Simpler Code for Restricted Dividend
Signed Division by Multiplication of Constant
Algorithm Divisors 2 = d
Signed Division by
Signed Division by -2n
Signed Division by 2n
Remainder of Signed Division by 2 or
Remainder of Signed Division by 2n or -2n
164
Alternative Code for Multiplying by a Constant
165
166
Latency of Repeated String Instructions
Repeated String Instructions
Guidelines for Repeated String Instructions
167
168
Use the Largest Possible Operand Size
169
Using XOR to Clear Integer Registers
Acceptable
Bit Addition
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Bit Subtraction
Bit Negation
Bit Multiplication
Bit Right Shift
Bit Unsigned Division
171
172
173
Bit Signed Division
174
Bit Unsigned Remainder Computation
175
176
Bit Signed Remainder Computation
177
178
179
180
Integer Version
181
Efficient Binary-to-ASCII Decimal Conversion
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
182
183
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
184
185
186
Unsigned Integer Division
187
188
189
Signed Integer Division
Example Code
190
191
192
Optimizing Integer Division
193
Optimizing with Simd Instructions
194
195
Ensure All Packed Floating-Point Data are Aligned
196
Rationale-Single Precision
197
Rational-Double Precision
198
Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
199
Use Movapd and Movaps Instead of Movupd and Movups
Loop type Description
200
201
202
Double-Precision 32 ⋅ 32 Matrix Multiplication
203
204
205
206
207
208
Passing Data between MMX and 3DNow! Instructions
209
Storing Floating-Point Data in MMX Registers
210
Emms and Femms Usage
XMM Text Segment
211
212
213
214
215
Single Precision
Double Precision
216
Clearing MMX and XMM Registers with XOR Instructions
217
218
219
220
221
Code below Puts the Floating Point Sign Mask
222
223
224
225
226
Pfpnacc
227
228
229
230
Listing 27 ⋅ 4 Matrix Multiplication SSE
231
XMM3
232
Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology
233
234
235
236
237
X87 Floating-Point Optimizations
238
Using Multiplication Rather Than Division
239
Achieving Two Floating-Point Operations per Clock Cycle
240
241
242
Align and Pack DirectPath x87 Instructions
243
244
Floating-Point Compare Instructions
245
Using the Fxch Instruction Rather Than FST/FLD Pairs
246
Floating-Point Subexpression Elimination
247
Accumulating Precision-Sensitive Quantities in x87 Registers
248
Avoiding Extended-Precision Data
249
Key Microarchitecture Features
251
Processor Block Diagram
Superscalar Processor
L1 Instruction Cache
AMD Athlon 64 and AMD Opteron Processors Block Diagram
253
L1 Instruction Cache Specifications by Processor
Branch-Prediction Table
Fetch-Decode Unit
L1 Instruction TLB Specifications
Instruction Control Unit
Translation-Lookaside Buffer
10 L1 Data Cache
L1 Data TLB Specifications
L2 TLB Specifications
Integer Execution Unit
L1 Data Cache Specifications by Processor
Integer Scheduler
257
Floating-Point Scheduler
Floating-Point Unit
Floating-Point Execution Unit
Load-Store Unit
259
16 L2 Cache
HyperTransport Technology Interface
Buses for AMD Athlon 64 and AMD Opteron Processor
Integrated Memory Controller
261
HyperTransport Technology
Software Optimization Guide for AMD64 Processors
263
Write-Combining Definitions and Abbreviations
264
Programming Details
Write-combining Operations
265
Write-Combining Completion Events
266
Sending Write-Buffer Data to the System
267
Optimizations
268
269
Appendix C Instruction Latencies
Example Instruction Entry
Understanding Instruction Entries
Parts of the Instruction Entry
270
271
Interpreting Placeholders
272
Interpreting Latencies
Integer Instructions
Integer Instructions
273
AAA
ADD reg16/32/64, mem16/32/64
274
Bswap EAX/RAX/R8
275
276
277
Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64
278
CMP reg16/32/64, mreg16/32/64
279
280
281
282
283
JA/JNBE disp16/32
284
Lahf
285
286
Mfence
287
288
NOP Xchg EAX, EAX
289
290
291
292
Rdmsr
293
Rdpmc
Rdtsc
Sahf
294
SBB reg16/32/64, mem16/32/64
295
296
297
298
STC
299
STD
STI
Syscall
300
Sysenter
Sysexit
301
Xchg AX/EAX/RAX, DI/EDI/RDI/R15
302
Xchg AX/EAX/RAX, SI/ESI/RSI/R14
303
MMX Technology Instructions
MMX Technology Instructions
Fmul
304
305
306
307
X87 Floating-Point Instructions
X87 Floating-Point Instructions
Fadd Fcompp
308
Fadd Fcos
Fdecstp
FADD/FMUL Fstore Finit
309
Fincstp
310
311
312
Fxch
313
Fxtract
FYL2X
3DNow! Technology Instructions
3DNow! Technology Instructions
314
Femms
315
DNow! Technology Instructions
316
3DNow! Technology Extensions
3DNow! Technology Extensions
317
SSE Instructions
SSE Instructions
318
319
320
321
Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath
322
Sfence
323
324
325
326
SSE2 Instructions
SSE2 Instructions
Fmul Fstore
327
0FH
328
Maskmovdqu
329
Fadd Fmul Fstore
330
331
332
Fadd Fmul
333
334
335
336
337
338
Fmul Punpckhdq
339
Fmul Punpckhbw
340
341
342
SSE3 Instructions
SSE3 Instructions
343
344
345
Fast-Write Optimizations
346
Fast-Write Optimizations for Graphics-Engine Programming
347
Cacheable-Memory Command Structure
348
349
Fast-Write Optimizations for Video-Memory Copies
350
351
Memory Optimizations
352
Northbridge Command Flow
353
Optimizations for Texture-Map Copies to AGP Memory
Optimizations for Vertex-Geometry Copies to AGP Memory
354
355
Types of XMM-Register Data
Types of SSE and SSE2 Instructions
356
Half-Register Operations
357
Zeroing Out an XMM Register
Clearing XMM Registers
358
359
Reuse of Dead Registers
360
Moving Data Between XMM Registers and GPRs
361
Saving and Restoring Registers of Unknown Format
362
SSE and SSE2 Copy Loops
363
Explicit Load Instructions
364
Data Conversion
Converting Scalar Values
Converting Directly from Memory
Converting Vector Values
365
INT GPR FPS
366
367
Numerics
368 Index