218 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
9.15 Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNow!™
Instructions
Optimization
In 32-bit software, use the 3DNow! PFACC instruction to perform complex-number multiplication,
4×4 matrix multiplication, and dot products. For 64-bit software, careful selection of SSE
instructions based on how the data is organized can also lead to more efficient code, as shown in the
second example.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Though SSE, SSE2, and 3DNow! instrucitons are similar in the sense that they all have vectorized
multiplication and addition, 3DNow! technology supports certain special instructions. One of these is
the PFACC instruction. There are many instances where PFACC is useful, such as complex-number
multiplication, 4 ×4 matrix multiplication, and dot products.
Examples
The following example accumulates two floats in two MMX registers:
;accumulate_3dnow(float *a_and_b, float *c_and_d, float *aplusb_cplusd);
;
; TO ASSEMBLE INTO *.obj DO THE FOLLOWING:
; ml.exe -coff -c accumulate_3dnow.asm
;
.586
.K3D
.XMM
_TEXT SEGMENT
PUBLIC _accumulate_3dnow
_accumulate_3dnow PROC NEAR
;==============================================================================
; INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS ENTERED
; REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED)
; WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM