Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

pswapd

mm4, QWORD PTR [esi+ecx*8]

; mm4=[x0r,x0i]

pswapd

mm5, QWORD PTR [esi+ecx*8+8]

; mm5=[x1r,x1i]

pswapd

mm6, QWORD PTR [esi+ecx*8+16]

; mm6=[x2r,x2i]

pswapd

mm7, QWORD PTR [esi+ecx*8+24]

; mm7=[x3r,x3i]

pfmul

mm0, QWORD PTR [edi+ecx*8]

; mm0=[x0i*y0i,x0r*y0r]

pfmul

mm1, QWORD PTR [edi+ecx*8+8]

; mm1=[x1i*y1i,x1r*y1r]

pfmul

mm2, QWORD PTR [edi+ecx*8+16]

; mm2=[x2i*y2i,x2r*y2r]

pfmul

mm3, QWORD PTR [edi+ecx*8+24]

; mm3=[x3i*y3i,x3r*y3r]

pfmul

mm4, QWORD PTR [edi+ecx*8]

; mm4=[x0r*y0i,x0i*y0r]

pfmul

mm5, QWORD PTR [edi+ecx*8+8]

; mm5=[x1r*y1i,x1i*y1r]

pfmul

mm6, QWORD PTR [edi+ecx*8+16]

; mm6=[x2r*y2i,x2i*y2r]

pfmul

mm7, QWORD PTR [edi+ecx*8+24]

; mm7=[x3r*y3i,x3i*y3r]

pfpnacc

mm0, mm4

; mm0=[x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i]

pfpnacc

mm1, mm5

; mm1=[x1r*y1i+x1i*y1r,x1r*y1r-x1i*y1i]

pfpnacc

mm2, mm6

; mm2=[x2r*y2i+x2i*y2r,x2r*y2r-x2i*y2i]

pfpnacc

mm3, mm7

; mm3=[x3r*y3i+x3i*y3r,x3r*y3r-x3i*y3i]

movntq

[eax+ecx*8], mm0

; Stream MM0-MM3 to representative memory

movntq

[eax+ecx*8+8], mm1

; addresses of prod[]

movntq

[eax+ecx*8+16], mm2

 

movntq

[eax+ecx*8+24], mm3

 

add

ecx, 4

; ECX = ECX + 4

jnz

four_cmplx_prod_loop

sfence

 

; Finish all memory writes.

;==============================================================================

;INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS

;ENTERED

;REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED

;WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM femms

pop edi pop esi pop ebx

mov esp, ebp

pop ebp ;===============================================================================

ret _cmplx_multiply_3dnow ENDP _TEXT ENDS

END

The illustrations above make use of many optimization techniques. First, the 3DNow! technology code utilizes the PSWAPD and PFPNACC instructions, whose operations are outlined below:

;PSWAPD

;Suppose that MM0 contains two floats: r and i.

;INPUT:

;MM0 = [i,r]

;OUTPUT:

;MM1 = [r,i]

226

Optimizing with SIMD Instructions

Chapter 9

Page 242
Image 242
AMD 250 manual 226