Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Additionally, four complex numbers are concurrently multiplied in the examples using SSE and 3DNow! instructions to break up register dependencies. Loads, multiplications, and additions do not execute with zero delay, but have a latency associated with them. The following instructions:

movq

mm0, QWORD PTR [esi+ecx*8]

; mm0 = [x0i,x0r]

pswapd

mm4, QWORD PTR [esi+ecx*8]

; mm4 = [x0r,x0i]

pfmul

mm0, QWORD PTR [edi+ecx*8]

; mm0 = [x0i*y0i,x0r*y0r]

pfmul

mm4,

QWORD PTR [edi+ecx*8]

;

mm4

=

[x0r*y0i,x0i*y0r]

pfpnacc mm0,

mm4

;

mm0

=

[x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i]

are dependent upon one another. The move from memory (MOVQ) requires 2 cycles, PSWAPD also requires 2 cycles, the two PFMUL instructions require 6 cycles, and PFPNACC requires 6 cycles. The instruction flow through the processor is illustrated on a clock-cycle basis, as follows:

Instruction

0

2

4

6

8

10

12

14

MOVQ

xxxxxx

 

 

 

 

 

 

 

PSWAPD

xxxxxx

 

 

 

 

 

 

 

PFMUL

 

xxxxxxxxxxxxxxxxxx

 

 

 

 

PFMUL

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

PFPNACC

 

 

 

 

 

xxxxxxxxxxxxxxxxxxx

and takes 15 cycles to finish. During this 15 cycles, the processor has the ability to perform 60 single- precision floating-point operations, of which it only performs six. The majority of the time is spent waiting for previous instructions to terminate so that arguments to future instructions are available. By unrolling the multiplication, working with four complex numbers per clock, there are enough

228

Optimizing with SIMD Instructions

Chapter 9

Page 244
Image 244
AMD 250 manual 228