Software Optimization Guide For AMD64 Processors
2001 2005 Advanced Micro Devices, Inc. All rights reserved
Contents
General 64-Bit Optimizations
Chapter Cache and Memory Optimizations
Chapter Integer Optimizations 159
Chapter X87 Floating-Point Optimizations 237
Appendix B Implementation of Write-Combining
Viii
Index
Software Optimization Guide for AMD64 Processors
Tables
Tables
Xii Tables
Figures
Xiv
Revision History
Xvi Revision History
Intended Audience
Getting Started Quickly
Using This Guide
Typographic Notation
Special Information
Numbering Systems
Providing Feedback
Important New Terms
Primitive Operations
Internal Instruction Formats
Types of Instructions
Instructions, Macro-ops and Micro-ops
Mrom
Key Optimizations by Rank
Key Optimizations
Guideline
Optimizations by Rank
C++ Source-Level Optimizations
C++ Source-Level Optimizations
Application
Declarations of Floating-Point Values
Optimization
Rationale
Using Arrays and Pointers
Example
Matrix
Additional Considerations
Instead, use the equivalent array notation
Unrolling Small Loops
Related Information
Expression Order in Compound Branch Conditions
Chapter C++ Source-Level Optimizations
Long Logical Expressions in If Statements
Arrange Boolean Operands for Quick Expression Evaluation
If *p == y && strlenp
Dynamic Memory Allocation Consideration
Unnecessary Store-to-Load Dependencies
Listing 3. Avoid
Application
Matching Store and Load Size
Examples
Listing 6. Preferred
C++ Source-Level Optimizations
Switch and Noncontiguous Case Expressions
Example
Related Information
Arranging Cases by Probability of Occurrence
Use of Function Prototypes
Use of const Type Qualifier
Generic Loop Hoisting
Rationale and Examples
Listing
Chapter C++ Source-Level Optimizations
Local Static Functions
Explicit Parallelism in Code
Listing 11. Preferred
Extracting Common Subexpressions
Listing 15. Example 2 Preferred
Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
C++ Source-Level Optimizations
Sorting Local Variables
Related Information
Replacing Integer Division with Multiplication
Frequently Dereferenced Pointer Arguments
Listing 16. Avoid
Listing 17. Preferred
Array Indices
23 32-Bit Integral Data Types
Rational
Sign of Integer Operands
Listing 20. Example 2 Avoid
Accelerating Floating-Point Division and Square Root
Examples
Fast Floating-Point-to-Integer Conversion
Listing 23. Slow
Speeding Up Branches Based on Comparisons Between Floats
Branches Dependent on Integer Comparisons Are Fast
Comparisons against Positive Constant
Comparisons among Two Floats
Improving Performance in Linux Libraries
Software Optimization Guide for AMD64 Processors
General 64-Bit Optimizations
This code performs 64-bit addition using 32-bit registers
64-Bit Registers and Integer Arithmetic
ESP+8ESP+4 = multiplicand
64-Bit Arithmetic and Large-Integer Multiplication
Background
G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0
XMM
END
Text Segment
128-Bit Media Instructions and Floating-Point Operations
32-Bit Legacy GPRs and Small Unsigned Integers
Chapter General 64-Bit Optimizations
General 64-Bit Optimizations
Instruction-Decoding Optimizations
DirectPath Instructions
Load-Execute Instructions
Load-Execute Integer Instructions Optimization
Movss xmm0, floatvar1 mulss xmm0, floatvar2
Application
Branch Targets in Program Hot Spots
32/64-Bit vs -Bit Forms of the LEA Instruction
Take Advantage of x86 and AMD64 Complex Addressing Modes
Cmpb %al,0x68e35%r10,%r13
Short Instruction Encodings
Partial-Register Reads and Writes
Avoid
Using Leave for Function Epilogues
Functions That Allocate Local Variables
Functions That Do not Allocate Local Variables
Traditional function epilogue looks like this
Alternatives to Shld Instruction
Lea reg1, reg1*8+reg2
10 8-Bit Sign-Extended Immediate Values
11 8-Bit Sign-Extended Displacements
Code Padding with Operand-Size Override
NOP
Instruction-Decoding Optimizations
Cache and Memory Optimizations
Bit Avoid
Memory-Size Mismatches
Examples-Store-to-Load-Forwarding Stalls
Avoid
Preferred If the Contents of MM0 are No Longer Needed
Examples-Large-to-small Mismatches
Preferred If Stores Are Close to the Load
Preferred If the Stores and Loads are Close Together, Option
Natural Alignment of Data Objects
Cache-Coherent Nonuniform Memory Access ccNUMA
CPU0
Dual-Core AMD Opteron Processor Configuration
OS Implications
Multiprocessor Considerations
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding Pitfalls-True Dependencies
100
Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
101
Store-to-Load Forwarding-False Dependencies
High-Byte Store-Buffer Data-Forwarding Restriction
One Supported Store-to-Load Forwarding Case
102
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
103
Prefetch Instructions
Prefetching versus Preloading
104
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
Unit-Stride Access
Hardware Prefetching
105
Prefetchw versus Prefetch
Write-Combining Usage
106
Multiple Prefetches
107
Determining Prefetch Distance
108
Memory-Limited Code
Processor-Limited Code
Definitions
Cache and Memory Optimizations
Prefetch at Least 64 Bytes Away from Surrounding Stores
111
Streaming-Store/Non-Temporal Instructions
Write-combining
113
How to Know If a Bank Conflict Exists
L1 Data Cache Bank Conflicts
Fields Used to Address the Multibank L1 Data Cache
115
Placing Code and Data in the Same 64-Byte Cache Line
Sorting and Padding C and C++ Structures
117
118
119
Memory Copy
Copying Small Data Structures
120
121
Optimized Stack Usage
Stack Considerations
Extend Arguments to 32 Bits Before Pushing onto Stack
122
Cache Issues when Writing Instruction Bytes to Memory
123
Interleave Loads and Stores
124
This Chapter
125
Density of Branches
126
127
Align
Two-Byte Near-Return RET Instruction
128
129
Unsigned Integer min Function z = x y ? x y
Branches That Depend on Random Data
Signed Integer ABS Function x = labsx
130
Conditional Write
131
Pairing Call and Return
132
Recursive Functions
133
134
Nonzero Code-Segment Base Values
135
Replacing Branches with Computation
Muxing Constructs
136
SSE Solution Preferred
MMX Solution Avoid
137
Example 1 3DNow! Code
Sample Code Translated into AMD64 Code
Example 1 C Code
Example 2 C Code
Example 4 C Code
Example 3 C Code
Example 3 3DNow! Code
Example 4 3DNow! Code
Example 5 C Code
Example 5 3DNow! Code
140
Loop Instruction
141
Far Control-Transfer Instructions
142
Chapter Scheduling Optimizations
143
Instruction Scheduling by Latency
144
Loop Unrolling
Loop Unrolling
145
Partial Loop Unrolling
Complete Loop Unrolling
Example Complete Loop Unrolling
Example Partial Loop Unrolling
Fadd
147
Deriving the Loop Control for Partially Unrolled Loops
148
Inline Functions
149
Additional Recommendations
150
Address-Generation Interlocks
Address-Generation Interlocks
151
152
Movzx and Movsx
153
Pointer Arithmetic in Loops
154
155
156
Pushing Memory Data Directly onto the Stack
157
158
Chapter Integer Optimizations
159
Signed Division Utility
Replacing Division with Multiplication
Multiplication by Reciprocal Division Utility
Unsigned Division Utility
Algorithm Divisors 231 = d
Unsigned Division by Multiplication of Constant
Algorithm Divisors 1 = d 231, Odd d
161
Algorithm Divisors 2 = d
Signed Division by Multiplication of Constant
Simpler Code for Restricted Dividend
Signed Division by
Remainder of Signed Division by 2 or
Signed Division by 2n
Signed Division by -2n
Remainder of Signed Division by 2n or -2n
Alternative Code for Multiplying by a Constant
164
165
166
Guidelines for Repeated String Instructions
Repeated String Instructions
Latency of Repeated String Instructions
167
Use the Largest Possible Operand Size
168
Using XOR to Clear Integer Registers
Acceptable
169
Bit Subtraction
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Bit Addition
Bit Negation
Bit Unsigned Division
Bit Right Shift
Bit Multiplication
171
172
Bit Signed Division
173
Bit Unsigned Remainder Computation
174
175
Bit Signed Remainder Computation
176
177
178
179
Integer Version
180
Efficient Binary-to-ASCII Decimal Conversion
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
181
182
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
183
184
185
Unsigned Integer Division
186
187
188
Signed Integer Division
Example Code
189
190
191
Optimizing Integer Division
192
Optimizing with Simd Instructions
193
194
Ensure All Packed Floating-Point Data are Aligned
195
Rationale-Single Precision
196
Rational-Double Precision
197
Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
198
Use Movapd and Movaps Instead of Movupd and Movups
199
200
Loop type Description
201
Double-Precision 32 ⋅ 32 Matrix Multiplication
202
203
204
205
206
207
Passing Data between MMX and 3DNow! Instructions
208
Storing Floating-Point Data in MMX Registers
209
Emms and Femms Usage
210
211
XMM Text Segment
212
213
214
Single Precision
Double Precision
215
Clearing MMX and XMM Registers with XOR Instructions
216
217
218
219
220
221
222
Code below Puts the Floating Point Sign Mask
223
224
225
226
227
Pfpnacc
228
229
230
231
Listing 27 ⋅ 4 Matrix Multiplication SSE
232
XMM3
233
Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology
234
235
236
X87 Floating-Point Optimizations
237
Using Multiplication Rather Than Division
238
Achieving Two Floating-Point Operations per Clock Cycle
239
240
241
Align and Pack DirectPath x87 Instructions
242
243
Floating-Point Compare Instructions
244
Using the Fxch Instruction Rather Than FST/FLD Pairs
245
Floating-Point Subexpression Elimination
246
Accumulating Precision-Sensitive Quantities in x87 Registers
247
Avoiding Extended-Precision Data
248
249
Key Microarchitecture Features
Processor Block Diagram
Superscalar Processor
251
AMD Athlon 64 and AMD Opteron Processors Block Diagram
L1 Instruction Cache
L1 Instruction Cache Specifications by Processor
Branch-Prediction Table
253
Instruction Control Unit
L1 Instruction TLB Specifications
Fetch-Decode Unit
Translation-Lookaside Buffer
L1 Data TLB Specifications
L2 TLB Specifications
10 L1 Data Cache
L1 Data Cache Specifications by Processor
Integer Scheduler
Integer Execution Unit
Floating-Point Scheduler
257
Floating-Point Execution Unit
Load-Store Unit
Floating-Point Unit
16 L2 Cache
259
Buses for AMD Athlon 64 and AMD Opteron Processor
Integrated Memory Controller
HyperTransport Technology Interface
HyperTransport Technology
261
Software Optimization Guide for AMD64 Processors
Write-Combining Definitions and Abbreviations
263
Programming Details
Write-combining Operations
264
Write-Combining Completion Events
265
Sending Write-Buffer Data to the System
266
Optimizations
267
268
Appendix C Instruction Latencies
269
Parts of the Instruction Entry
Understanding Instruction Entries
Example Instruction Entry
270
Interpreting Placeholders
271
Interpreting Latencies
272
273
Integer Instructions
Integer Instructions
AAA
274
ADD reg16/32/64, mem16/32/64
275
Bswap EAX/RAX/R8
276
277
278
Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64
279
CMP reg16/32/64, mreg16/32/64
280
281
282
283
284
JA/JNBE disp16/32
285
Lahf
286
287
Mfence
288
289
NOP Xchg EAX, EAX
290
291
292
Rdpmc
293
Rdmsr
Rdtsc
294
Sahf
295
SBB reg16/32/64, mem16/32/64
296
297
298
STD
299
STC
STI
Sysenter
300
Syscall
Sysexit
301
302
Xchg AX/EAX/RAX, SI/ESI/RSI/R14
Xchg AX/EAX/RAX, DI/EDI/RDI/R15
MMX Technology Instructions
MMX Technology Instructions
303
304
Fmul
305
306
X87 Floating-Point Instructions
X87 Floating-Point Instructions
307
Fadd Fcos
308
Fadd Fcompp
Fdecstp
309
Fincstp
FADD/FMUL Fstore Finit
310
311
312
Fxtract
313
Fxch
FYL2X
314
3DNow! Technology Instructions
3DNow! Technology Instructions
Femms
DNow! Technology Instructions
315
3DNow! Technology Extensions
3DNow! Technology Extensions
316
SSE Instructions
SSE Instructions
317
318
319
320
321
322
Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath
323
Sfence
324
325
SSE2 Instructions
SSE2 Instructions
326
327
0FH
Fmul Fstore
328
329
Maskmovdqu
330
Fadd Fmul Fstore
331
332
333
Fadd Fmul
334
335
336
337
338
339
Fmul Punpckhbw
Fmul Punpckhdq
340
341
SSE3 Instructions
SSE3 Instructions
342
343
344
Fast-Write Optimizations
345
Fast-Write Optimizations for Graphics-Engine Programming
346
Cacheable-Memory Command Structure
347
348
Fast-Write Optimizations for Video-Memory Copies
349
350
Memory Optimizations
351
Northbridge Command Flow
352
Optimizations for Texture-Map Copies to AGP Memory
Optimizations for Vertex-Geometry Copies to AGP Memory
353
354
Types of XMM-Register Data
Types of SSE and SSE2 Instructions
355
Half-Register Operations
356
Zeroing Out an XMM Register
Clearing XMM Registers
357
358
Reuse of Dead Registers
359
Moving Data Between XMM Registers and GPRs
360
Saving and Restoring Registers of Unknown Format
361
SSE and SSE2 Copy Loops
362
Explicit Load Instructions
363
Data Conversion
Converting Scalar Values
364
365
Converting Vector Values
Converting Directly from Memory
INT GPR FPS
366
Numerics
367
368 Index