25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 9 Optimizing with SIMD Instructions

The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be used to encode floating-point and integer operation.

The SIMD instructions use a flat register file rather than the stack register file used by x87 floating-point instructions. This allows arbitrary sequences of operations to map more efficiently to the instruction set.

Future processors with more or wider multipliers and adders will achieve better throughput using SSE and SSE2 instructions. (Today’s processors implement a 128-bit-wide SSE or SSE2 operation as two 64-bit operations that are internally pipelined.)

SSE and SSE2 instructions work well in both 32-bit and 64-bit threads.

The SIMD instructions provide a theoretical single-precision peak throughput of two additions and two multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one multiplication per clock cycle. The SSE2 and x87 double-precision peak throughput is the same, but SSE2 instructions provide better code density.

This chapter covers the following topics:

Topic

Page

 

 

Ensure All Packed Floating-Point Data are Aligned

195

 

 

Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS

196

When Loading Data from Memory

 

 

 

Structuring Code with Prefetch Instructions to Hide Memory Latency

200

 

 

Avoid Moving Data Directly Between General-Purpose and MMX™ Registers

206

 

 

Use MMX™ Instructions to Construct Fast Block-Copy Routines in 32-Bit Mode

207

 

 

Passing Data between MMX™ and 3DNow!™ Instructions

208

 

 

Storing Floating-Point Data in MMX™ Registers

209

 

 

EMMS and FEMMS Usage

210

 

 

Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots

211

 

 

Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow!™ Instructions

215

 

 

Clearing MMX™ and XMM Registers with XOR Instructions

216

 

 

Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNow!™

217

Instructions

 

 

 

Accumulating Single-PrecisionFloating-Point Numbers Using SSE, SSE2, and 3DNow!™

218

Instructions

 

 

 

Accumulating Single-PrecisionFloating-Point Numbers Using SSE, SSE2, and 3DNow!™

218

Instructions

 

 

 

Chapter 9

Optimizing with SIMD Instructions

193

Page 209
Image 209
AMD 250 manual Optimizing with Simd Instructions, 193