Optimizing with Simd Instructions, 193

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 9 Optimizing with SIMD Instructions

The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be used to encode floating-point and integer operation.

•The SIMD instructions use a flat register file rather than the stack register file used by x87 floating-point instructions. This allows arbitrary sequences of operations to map more efficiently to the instruction set.

•Future processors with more or wider multipliers and adders will achieve better throughput using SSE and SSE2 instructions. (Today’s processors implement a 128-bit-wide SSE or SSE2 operation as two 64-bit operations that are internally pipelined.)

•SSE and SSE2 instructions work well in both 32-bit and 64-bit threads.

The SIMD instructions provide a theoretical single-precision peak throughput of two additions and two multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one multiplication per clock cycle. The SSE2 and x87 double-precision peak throughput is the same, but SSE2 instructions provide better code density.

This chapter covers the following topics:

Topic	Page

Ensure All Packed Floating-Point Data are Aligned	195

Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS	196
When Loading Data from Memory

Structuring Code with Prefetch Instructions to Hide Memory Latency	200

Avoid Moving Data Directly Between General-Purpose and MMX™ Registers	206

Use MMX™ Instructions to Construct Fast Block-Copy Routines in 32-Bit Mode	207

Passing Data between MMX™ and 3DNow!™ Instructions	208

Storing Floating-Point Data in MMX™ Registers	209

EMMS and FEMMS Usage	210

Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots	211

Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow!™ Instructions	215

Clearing MMX™ and XMM Registers with XOR Instructions	216

Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNow!™	217
Instructions

Accumulating Single-PrecisionFloating-Point Numbers Using SSE, SSE2, and 3DNow!™	218
Instructions

Accumulating Single-PrecisionFloating-Point Numbers Using SSE, SSE2, and 3DNow!™	218
Instructions

Chapter 9

Optimizing with SIMD Instructions

AMD 250 manual Optimizing with Simd Instructions, 193

Models: 250

Chapter 9 Optimizing with SIMD Instructions

Topic

193