25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

instructions that are not dependent on previous or presently executing operations so that the processor can mask the execution latency by keeping itself busy, as illustrated below:

Instruction

0

2

4

6

8

10

12

14

16

18

MOVQ

xxxxxx

 

 

 

 

 

 

 

 

MOVQ

xxxxxx

 

 

 

 

 

 

 

 

MOVQ

 

xxxxxx

 

 

 

 

 

 

 

 

MOVQ

 

xxxxxx

 

 

 

 

 

 

 

 

PSWAPD

 

xxxxxx

 

 

 

 

 

 

 

PSWAPD

 

xxxxxx

 

 

 

 

 

 

 

PSWAPD

 

 

xxxxxx

 

 

 

 

 

 

 

PSWAPD

 

 

xxxxxx

 

 

 

 

 

 

 

PFMUL

 

xxxxxxxxxxxxxxxxxx

 

 

 

 

 

PFMUL

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

 

 

PFMUL

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

 

PFMUL

 

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

 

PFMUL

 

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

PFMUL

 

 

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

PFMUL

 

 

 

 

xxxxxxxxxxxxxxxxxx

 

 

 

PFMUL

 

 

 

 

 

xxxxxxxxxxxxxxxxxx

 

 

PFPNACC

 

 

 

 

xxxxxxxxxxxxxxxxxxx

 

 

PFPNACC

 

 

 

 

 

xxxxxxxxxxxxxxxxxxx

 

 

PFPNACC

 

 

 

 

 

xxxxxxxxxxxxxxxxxxx

 

PFPNACC

 

 

 

 

 

 

xxxxxxxxxxxxxxxxxxx

 

Multiplying four complex single-precision numbers only takes 17 cycles as opposed to 14 cycles to multiply one complex single-precision number. The floating-point pipes are kept busy by feeding new instructions into the floating-point pipeline each cycle. In the arrangement above, 24 floating-point operations are performed in 17 cycles, achieving more than a 3.5x increase in performance.

The last optimization in both implementations is the use of the MOVNTQ and MOVNTPS instructions, nontemporal writes to memory that stream data to main memory. These instructions increase throughput to memory and make more efficient use of the bandwidth provided by the processor and memory controller. Nontemporal writes, such as MOVNTQ, MOVNTPS, and MOVNTDQ, should only be used on data that is not going to be accessed again in the near future.

Chapter 9

Optimizing with SIMD Instructions

229

Page 245
Image 245
AMD 250 manual 229