226 Optimizing with SIMD Instructions Chapter 9
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
pswapd mm4, QWORD PTR [esi+ecx*8] ; mm4=[x0r,x0i]
pswapd mm5, QWORD PTR [esi+ecx*8+8] ; mm5=[x1r,x1i]
pswapd mm6, QWORD PTR [esi+ecx*8+16] ; mm6=[x2r,x2i]
pswapd mm7, QWORD PTR [esi+ecx*8+24] ; mm7=[x3r,x3i]
pfmul mm0, QWORD PTR [edi+ecx*8] ; mm0=[x0i*y0i,x0r*y0r]
pfmul mm1, QWORD PTR [edi+ecx*8+8] ; mm1=[x1i*y1i,x1r*y1r]
pfmul mm2, QWORD PTR [edi+ecx*8+16] ; mm2=[x2i*y2i,x2r*y2r]
pfmul mm3, QWORD PTR [edi+ecx*8+24] ; mm3=[x3i*y3i,x3r*y3r]
pfmul mm4, QWORD PTR [edi+ecx*8] ; mm4=[x0r*y0i,x0i*y0r]
pfmul mm5, QWORD PTR [edi+ecx*8+8] ; mm5=[x1r*y1i,x1i*y1r]
pfmul mm6, QWORD PTR [edi+ecx*8+16] ; mm6=[x2r*y2i,x2i*y2r]
pfmul mm7, QWORD PTR [edi+ecx*8+24] ; mm7=[x3r*y3i,x3i*y3r]
pfpnacc mm0, mm4 ; mm0=[x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i]
pfpnacc mm1, mm5 ; mm1=[x1r*y1i+x1i*y1r,x1r*y1r-x1i*y1i]
pfpnacc mm2, mm6 ; mm2=[x2r*y2i+x2i*y2r,x2r*y2r-x2i*y2i]
pfpnacc mm3, mm7 ; mm3=[x3r*y3i+x3i*y3r,x3r*y3r-x3i*y3i]
movntq [eax+ecx*8], mm0 ; Stream MM0-MM3 to representative memory
movntq [eax+ecx*8+8], mm1 ; addresses of prod[]
movntq [eax+ecx*8+16], mm2
movntq [eax+ecx*8+24], mm3
add ecx, 4 ; ECX = ECX + 4
jnz four_cmplx_prod_loop
sfence ; Finish all memory writes.
;==============================================================================
; INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS
; ENTERED
; REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED
; WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM
femms
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
;===============================================================================
ret
_cmplx_multiply_3dnow ENDP
_TEXT ENDS
END
The illustrations above make use of many optimization techniques. First, the 3DNow! technology code utilizes the PSWAPD and PFPNACC instructions, whose operations are outlined below:
; PSWAPD
; Suppose that MM0 contains two floats: r and i.
; INPUT:
; MM0 = [i,r]
; OUTPUT:
; MM1 = [r,i]