Software Optimization Guide For AMD64 Processors
2001 2005 Advanced Micro Devices, Inc. All rights reserved
Contents
General 64-Bit Optimizations
Chapter Cache and Memory Optimizations
Chapter Integer Optimizations 159
Chapter X87 Floating-Point Optimizations 237
Viii
Appendix B Implementation of Write-Combining
Index
Software Optimization Guide for AMD64 Processors
Tables
Tables
Xii Tables
Figures
Xiv
Revision History
Xvi Revision History
Getting Started Quickly
Intended Audience
Using This Guide
Providing Feedback
Special Information
Numbering Systems
Typographic Notation
Primitive Operations
Important New Terms
Internal Instruction Formats
Instructions, Macro-ops and Micro-ops
Types of Instructions
Mrom
Optimizations by Rank
Key Optimizations
Guideline
Key Optimizations by Rank
C++ Source-Level Optimizations
C++ Source-Level Optimizations
Rationale
Declarations of Floating-Point Values
Optimization
Application
Using Arrays and Pointers
Matrix
Example
Instead, use the equivalent array notation
Additional Considerations
Related Information
Unrolling Small Loops
Expression Order in Compound Branch Conditions
Chapter C++ Source-Level Optimizations
Long Logical Expressions in If Statements
Arrange Boolean Operands for Quick Expression Evaluation
If *p == y && strlenp
Dynamic Memory Allocation Consideration
Listing 3. Avoid
Unnecessary Store-to-Load Dependencies
Application
Examples
Matching Store and Load Size
Listing 6. Preferred
C++ Source-Level Optimizations
Switch and Noncontiguous Case Expressions
Example
Related Information
Arranging Cases by Probability of Occurrence
Use of Function Prototypes
Use of const Type Qualifier
Rationale and Examples
Generic Loop Hoisting
Listing
Chapter C++ Source-Level Optimizations
Local Static Functions
Explicit Parallelism in Code
Listing 11. Preferred
Extracting Common Subexpressions
Listing 15. Example 2 Preferred
Sorting and Padding C and C++ Structures
Sorting and Padding C and C++ Structures
C++ Source-Level Optimizations
Sorting Local Variables
Related Information
Replacing Integer Division with Multiplication
Listing 16. Avoid
Frequently Dereferenced Pointer Arguments
Listing 17. Preferred
Array Indices
Rational
23 32-Bit Integral Data Types
Sign of Integer Operands
Listing 20. Example 2 Avoid
Accelerating Floating-Point Division and Square Root
Examples
Fast Floating-Point-to-Integer Conversion
Listing 23. Slow
Branches Dependent on Integer Comparisons Are Fast
Speeding Up Branches Based on Comparisons Between Floats
Comparisons against Positive Constant
Comparisons among Two Floats
Improving Performance in Linux Libraries
Software Optimization Guide for AMD64 Processors
General 64-Bit Optimizations
64-Bit Registers and Integer Arithmetic
This code performs 64-bit addition using 32-bit registers
ESP+8ESP+4 = multiplicand
Background
64-Bit Arithmetic and Large-Integer Multiplication
G1 = c3 E1 + f1 + g0 = c2 D1 + e0 + f0 = c1 D0 = c0
XMM
END
Text Segment
128-Bit Media Instructions and Floating-Point Operations
32-Bit Legacy GPRs and Small Unsigned Integers
Chapter General 64-Bit Optimizations
General 64-Bit Optimizations
Instruction-Decoding Optimizations
DirectPath Instructions
Load-Execute Integer Instructions Optimization
Load-Execute Instructions
Movss xmm0, floatvar1 mulss xmm0, floatvar2
Application
Branch Targets in Program Hot Spots
32/64-Bit vs -Bit Forms of the LEA Instruction
Take Advantage of x86 and AMD64 Complex Addressing Modes
Cmpb %al,0x68e35%r10,%r13
Short Instruction Encodings
Partial-Register Reads and Writes
Avoid
Functions That Allocate Local Variables
Using Leave for Function Epilogues
Functions That Do not Allocate Local Variables
Traditional function epilogue looks like this
Alternatives to Shld Instruction
Lea reg1, reg1*8+reg2
10 8-Bit Sign-Extended Immediate Values
11 8-Bit Sign-Extended Displacements
NOP
Code Padding with Operand-Size Override
Instruction-Decoding Optimizations
Cache and Memory Optimizations
Avoid
Memory-Size Mismatches
Examples-Store-to-Load-Forwarding Stalls
Bit Avoid
Examples-Large-to-small Mismatches
Preferred If the Contents of MM0 are No Longer Needed
Preferred If Stores Are Close to the Load
Preferred If the Stores and Loads are Close Together, Option
Natural Alignment of Data Objects
Cache-Coherent Nonuniform Memory Access ccNUMA
CPU0
OS Implications
Dual-Core AMD Opteron Processor Configuration
Multiprocessor Considerations
100
Store-to-Load Forwarding Restrictions
Store-to-Load Forwarding Pitfalls-True Dependencies
Narrow-to-Wide Store-Buffer Data-Forwarding Restriction
Misaligned Store-Buffer Data-Forwarding Restriction
Wide-to-Narrow Store-Buffer Data-Forwarding Restriction
101
102
High-Byte Store-Buffer Data-Forwarding Restriction
One Supported Store-to-Load Forwarding Case
Store-to-Load Forwarding-False Dependencies
103
Summary of Store-to-Load-Forwarding Pitfalls to Avoid
Prefetching versus Preloading
Prefetch Instructions
104
105
Unit-Stride Access
Hardware Prefetching
PREFETCH/W versus PREFETCHNTA/T0/T1/T2
Write-Combining Usage
Prefetchw versus Prefetch
106
107
Multiple Prefetches
108
Determining Prefetch Distance
Processor-Limited Code
Memory-Limited Code
Cache and Memory Optimizations
Definitions
111
Prefetch at Least 64 Bytes Away from Surrounding Stores
Streaming-Store/Non-Temporal Instructions
113
Write-combining
L1 Data Cache Bank Conflicts
How to Know If a Bank Conflict Exists
Fields Used to Address the Multibank L1 Data Cache
115
Placing Code and Data in the Same 64-Byte Cache Line
117
Sorting and Padding C and C++ Structures
118
119
Copying Small Data Structures
Memory Copy
120
121
122
Stack Considerations
Extend Arguments to 32 Bits Before Pushing onto Stack
Optimized Stack Usage
123
Cache Issues when Writing Instruction Bytes to Memory
124
Interleave Loads and Stores
125
This Chapter
126
Density of Branches
Align
127
128
Two-Byte Near-Return RET Instruction
129
130
Branches That Depend on Random Data
Signed Integer ABS Function x = labsx
Unsigned Integer min Function z = x y ? x y
131
Conditional Write
132
Pairing Call and Return
133
Recursive Functions
134
135
Nonzero Code-Segment Base Values
Muxing Constructs
Replacing Branches with Computation
136
MMX Solution Avoid
SSE Solution Preferred
137
Example 2 C Code
Sample Code Translated into AMD64 Code
Example 1 C Code
Example 1 3DNow! Code
Example 4 3DNow! Code
Example 3 C Code
Example 3 3DNow! Code
Example 4 C Code
Example 5 3DNow! Code
Example 5 C Code
140
141
Loop Instruction
142
Far Control-Transfer Instructions
143
Chapter Scheduling Optimizations
144
Instruction Scheduling by Latency
Loop Unrolling
Loop Unrolling
145
Example Partial Loop Unrolling
Complete Loop Unrolling
Example Complete Loop Unrolling
Partial Loop Unrolling
147
Fadd
148
Deriving the Loop Control for Partially Unrolled Loops
149
Inline Functions
150
Additional Recommendations
Address-Generation Interlocks
Address-Generation Interlocks
151
152
153
Movzx and Movsx
154
Pointer Arithmetic in Loops
155
156
157
Pushing Memory Data Directly onto the Stack
158
159
Chapter Integer Optimizations
Unsigned Division Utility
Replacing Division with Multiplication
Multiplication by Reciprocal Division Utility
Signed Division Utility
161
Unsigned Division by Multiplication of Constant
Algorithm Divisors 1 = d 231, Odd d
Algorithm Divisors 231 = d
Signed Division by
Signed Division by Multiplication of Constant
Simpler Code for Restricted Dividend
Algorithm Divisors 2 = d
Remainder of Signed Division by 2n or -2n
Signed Division by 2n
Signed Division by -2n
Remainder of Signed Division by 2 or
164
Alternative Code for Multiplying by a Constant
165
166
167
Repeated String Instructions
Latency of Repeated String Instructions
Guidelines for Repeated String Instructions
168
Use the Largest Possible Operand Size
Acceptable
Using XOR to Clear Integer Registers
169
Bit Negation
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode
Bit Addition
Bit Subtraction
171
Bit Right Shift
Bit Multiplication
Bit Unsigned Division
172
173
Bit Signed Division
174
Bit Unsigned Remainder Computation
175
176
Bit Signed Remainder Computation
177
178
179
180
Integer Version
Binary-to-ASCII Decimal Conversion Retaining Leading Zeros
Efficient Binary-to-ASCII Decimal Conversion
181
182
183
Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
184
185
186
Unsigned Integer Division
187
188
Example Code
Signed Integer Division
189
190
191
192
Optimizing Integer Division
193
Optimizing with Simd Instructions
194
195
Ensure All Packed Floating-Point Data are Aligned
196
Rationale-Single Precision
197
Rational-Double Precision
198
Use MOVLPx/MOVHPx Instructions for Unaligned Data Access
199
Use Movapd and Movaps Instead of Movupd and Movups
Loop type Description
200
201
202
Double-Precision 32 ⋅ 32 Matrix Multiplication
203
204
205
206
207
208
Passing Data between MMX and 3DNow! Instructions
209
Storing Floating-Point Data in MMX Registers
210
Emms and Femms Usage
XMM Text Segment
211
212
213
214
Double Precision
Single Precision
215
216
Clearing MMX and XMM Registers with XOR Instructions
217
218
219
220
221
Code below Puts the Floating Point Sign Mask
222
223
224
225
226
Pfpnacc
227
228
229
230
Listing 27 ⋅ 4 Matrix Multiplication SSE
231
XMM3
232
Listing 28 ⋅ 4 Matrix Multiplication 3DNow! Technology
233
234
235
236
237
X87 Floating-Point Optimizations
238
Using Multiplication Rather Than Division
239
Achieving Two Floating-Point Operations per Clock Cycle
240
241
242
Align and Pack DirectPath x87 Instructions
243
244
Floating-Point Compare Instructions
245
Using the Fxch Instruction Rather Than FST/FLD Pairs
246
Floating-Point Subexpression Elimination
247
Accumulating Precision-Sensitive Quantities in x87 Registers
248
Avoiding Extended-Precision Data
249
Key Microarchitecture Features
Superscalar Processor
Processor Block Diagram
251
L1 Instruction Cache
AMD Athlon 64 and AMD Opteron Processors Block Diagram
Branch-Prediction Table
L1 Instruction Cache Specifications by Processor
253
Translation-Lookaside Buffer
L1 Instruction TLB Specifications
Fetch-Decode Unit
Instruction Control Unit
L2 TLB Specifications
L1 Data TLB Specifications
10 L1 Data Cache
Integer Scheduler
L1 Data Cache Specifications by Processor
Integer Execution Unit
257
Floating-Point Scheduler
Load-Store Unit
Floating-Point Execution Unit
Floating-Point Unit
259
16 L2 Cache
Integrated Memory Controller
Buses for AMD Athlon 64 and AMD Opteron Processor
HyperTransport Technology Interface
261
HyperTransport Technology
Software Optimization Guide for AMD64 Processors
263
Write-Combining Definitions and Abbreviations
Write-combining Operations
Programming Details
264
265
Write-Combining Completion Events
266
Sending Write-Buffer Data to the System
267
Optimizations
268
269
Appendix C Instruction Latencies
270
Understanding Instruction Entries
Example Instruction Entry
Parts of the Instruction Entry
271
Interpreting Placeholders
272
Interpreting Latencies
AAA
Integer Instructions
Integer Instructions
273
ADD reg16/32/64, mem16/32/64
274
Bswap EAX/RAX/R8
275
276
277
Mem16/32/64 CMOVNP/CMOVPO reg16/32/64, reg16/32/64
278
CMP reg16/32/64, mreg16/32/64
279
280
281
282
283
JA/JNBE disp16/32
284
Lahf
285
286
Mfence
287
288
NOP Xchg EAX, EAX
289
290
291
292
Rdtsc
293
Rdmsr
Rdpmc
Sahf
294
SBB reg16/32/64, mem16/32/64
295
296
297
298
STI
299
STC
STD
Sysexit
300
Syscall
Sysenter
301
Xchg AX/EAX/RAX, SI/ESI/RSI/R14
302
Xchg AX/EAX/RAX, DI/EDI/RDI/R15
MMX Technology Instructions
MMX Technology Instructions
303
Fmul
304
305
306
X87 Floating-Point Instructions
X87 Floating-Point Instructions
307
Fdecstp
308
Fadd Fcompp
Fadd Fcos
Fincstp
309
FADD/FMUL Fstore Finit
310
311
312
FYL2X
313
Fxch
Fxtract
Femms
3DNow! Technology Instructions
3DNow! Technology Instructions
314
315
DNow! Technology Instructions
3DNow! Technology Extensions
3DNow! Technology Extensions
316
SSE Instructions
SSE Instructions
317
318
319
320
321
Mem64 Prefetchnta mem8 0Fh 18h Mm-000-xxx DirectPath
322
Sfence
323
324
325
SSE2 Instructions
SSE2 Instructions
326
0FH
327
Fmul Fstore
328
Maskmovdqu
329
Fadd Fmul Fstore
330
331
332
Fadd Fmul
333
334
335
336
337
338
Fmul Punpckhbw
339
Fmul Punpckhdq
340
341
SSE3 Instructions
SSE3 Instructions
342
343
344
345
Fast-Write Optimizations
346
Fast-Write Optimizations for Graphics-Engine Programming
347
Cacheable-Memory Command Structure
348
349
Fast-Write Optimizations for Video-Memory Copies
350
351
Memory Optimizations
352
Northbridge Command Flow
Optimizations for Vertex-Geometry Copies to AGP Memory
Optimizations for Texture-Map Copies to AGP Memory
353
354
Types of SSE and SSE2 Instructions
Types of XMM-Register Data
355
356
Half-Register Operations
Clearing XMM Registers
Zeroing Out an XMM Register
357
358
359
Reuse of Dead Registers
360
Moving Data Between XMM Registers and GPRs
361
Saving and Restoring Registers of Unknown Format
362
SSE and SSE2 Copy Loops
363
Explicit Load Instructions
Converting Scalar Values
Data Conversion
364
INT GPR FPS
Converting Vector Values
Converting Directly from Memory
365
366
367
Numerics
368 Index