228

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Additionally, four complex numbers are concurrently multiplied in the examples using SSE and 3DNow! instructions to break up register dependencies. Loads, multiplications, and additions do not execute with zero delay, but have a latency associated with them. The following instructions:

movq	mm0, QWORD PTR [esi+ecx*8]		; mm0 = [x0i,x0r]
pswapd	mm4, QWORD PTR [esi+ecx*8]		; mm4 = [x0r,x0i]
pfmul	mm0, QWORD PTR [edi+ecx*8]		; mm0 = [x0iy0i,x0ry0r]
pfmul	mm4,	QWORD PTR [edi+ecx*8]	;	mm4	=	[x0ry0i,x0iy0r]
pfpnacc mm0,		mm4	;	mm0	=	[x0ry0i+x0iy0r,x0ry0r-x0iy0i]

are dependent upon one another. The move from memory (MOVQ) requires 2 cycles, PSWAPD also requires 2 cycles, the two PFMUL instructions require 6 cycles, and PFPNACC requires 6 cycles. The instruction flow through the processor is illustrated on a clock-cycle basis, as follows:

Instruction	0	2	4	6	8	10	12	14
MOVQ	xxxxxx
PSWAPD	xxxxxx
PFMUL		xxxxxxxxxxxxxxxxxx
PFMUL			xxxxxxxxxxxxxxxxxx
PFPNACC						xxxxxxxxxxxxxxxxxxx

and takes 15 cycles to finish. During this 15 cycles, the processor has the ability to perform 60 single- precision floating-point operations, of which it only performs six. The majority of the time is spent waiting for previous instructions to terminate so that arguments to future instructions are available. By unrolling the multiplication, working with four complex numbers per clock, there are enough

Optimizing with SIMD Instructions

Chapter 9

AMD 250 manual 228

Models: 250

Software Optimization Guide for AMD64 Processors

PFPNACC

228