218

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.15Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!™ Instructions

Optimization

In 32-bit software, use the 3DNow! PFACC instruction to perform complex-number multiplication, 4 ⋅ 4 matrix multiplication, and dot products. For 64-bit software, careful selection of SSE instructions based on how the data is organized can also lead to more efficient code, as shown in the second example.

Application

This optimization applies to:

•32-bit software

•64-bit software

Rationale

Though SSE, SSE2, and 3DNow! instrucitons are similar in the sense that they all have vectorized multiplication and addition, 3DNow! technology supports certain special instructions. One of these is the PFACC instruction. There are many instances where PFACC is useful, such as complex-number multiplication, 4 ⋅ 4 matrix multiplication, and dot products.

Examples

The following example accumulates two floats in two MMX registers:

;accumulate_3dnow(float *a_and_b, float *c_and_d, float *aplusb_cplusd);

;

;TO ASSEMBLE INTO *.obj DO THE FOLLOWING:

;ml.exe -coff -c accumulate_3dnow.asm

.586

.K3D

.XMM

_TEXT SEGMENT

PUBLIC _accumulate_3dnow

_accumulate_3dnow PROC NEAR

;==============================================================================

;INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS ENTERED

;REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED)

;WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM

Optimizing with SIMD Instructions

Chapter 9

AMD 250 manual 218

Models: 250