25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

push ebp

mov ebp, esp ;==============================================================================

;Parameters passed into routine:

;[ebp+8] = ->a_and_b

;[ebp+12] = ->c_and_d

;[ebp+16] = ->aplusb_cplusd ;============================================================================== push ebx

push esi push edi ;==============================================================================

;THE 4 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE

;REGISTERS (GPRS)

;esi = starting address of 2 floats "a_and_b"

;edi = starting address of 2 floats "c_and_d"

;eax = starting address of 2 floats "aplusb_cplusd" ;==============================================================================

mov esi, [ebp+8]

; esi = ->a_and_b

mov

edi,

[ebp+12]

;

edi =

->c_and_d

mov

eax,

[ebp+16]

;

eax =

->aplusb_cplusd

;==============================================================================

;ADD a AND b TOGETHER AND ALSO c AND d ;============================================================================== emms

movq

mm0,

[esi]

; mm0 = [b,a]

movq

mm1,

[edi]

;

mm1

= [d,c]

pfacc mm0,

mm1

;

mm0

= [c+d,b+a]

;==============================================================================

;INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE

;WAS ENTERED

;REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED)

;WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM pop edi

pop esi pop ebx mov esp,ebp pop ebp ;============================================================================== ret

_accumulate_3dnow ENDP _TEXT ENDS

END

The same operation can be performed using SSE instructions, but the data in the XMM registers must be rearranged. The next example loads four floating-point values into four XMM registers, XMM4– XMM7, and then rearranges and adds the values so as to accumulate the sum of each XMM register into a float in XMM1.

;----------------------------------------------------------------------

; The instructions below take the 4 floats in each XMM register below:

Chapter 9

Optimizing with SIMD Instructions

219

Page 235
Image 235
AMD 250 manual 219