Chapter 9 Optimizing with SIMD Instructions 193
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
Chapter 9 Optimizing with SIMD Instructions
The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be used to encode
floating-point and integer operation.
The SIMD instructions use a flat register file rather than the stack register file used by x87
floating-point instructions. This allows arbitrary sequences of operations to map more efficiently
to the instruction set.
Future processors with more or wider multipliers and adders will achieve better throughput using
SSE and SSE2 instructions. (Today’s processors implement a 128-bit-wide SSE or SSE2
operation as two 64-bit operations that are internally pipelined.)
SSE and SSE2 instructions work well in both 32-bit and 64-bit threads.
The SIMD instructions provide a theoretical single-precision peak throughput of two additions and
two multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one
multiplication per clock cycle. The SSE2 and x87 double-precision peak throughput is the same, but
SSE2 instructions provide better code density.
This chapter covers the following topics:
Topic Page
Ensure All Packed Floating-Point Data are Aligned 195
Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS
When Loading Data from Memory
196
Structuring Code with Prefetch Instructions to Hide Memory Latency 200
Avoid Moving Data Directly Between General-Purpose and MMX™ Registers 206
Use MMX™ Instructions to Construct Fast Block-Copy Routines in 32-Bit Mode 207
Passing Data between MMX™ and 3DNow!™ Instructions 208
Storing Floating-Point Data in MMX™ Registers 209
EMMS and FEMMS Usage 210
Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots 211
Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow!™ Instructions 215
Clearing MMX™ and XMM Registers with XOR Instructions 216
Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNow!™
Instructions
217
Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!™
Instructions
218
Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!™
Instructions
218