25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

9.12Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNow!™ Instructions

Optimization

For AMD Athlon, AMD Athlon 64, and AMD Opteron processors, use instructions that perform XOR operations (PXOR, XORPS, and XORPD) instead of multiplication instructions to change the sign bit of operands of SSE , SSE2, and 3DNow! instructions.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

On the AMD Athlon 64 and AMD Opteron processors, using XOR-type instructions allows for more parallelism, as these instructions can execute in either the FADD or FMUL pipe of the floating-point unit.

Single Precision

For single-precision, you can use either 3DNow! or SSE SIMD XOR operations. The latency of multiplying by –1.0 in 3DNow! is 4 cycles, while the latency of using the PXOR instruction is only

2 cycles. Similarly, the latency of the MULPS instruction is 5 cycles, while the latency of the XORPS instruction is 3 cycles. The following code example illustrates how to toggle the sign bit of a number using 3DNow! instructions:

signmask DQ 8000000080000000h

pxor mm0, [signmask]

; Toggle sign bits of both floats.

This example does the same thing using SSE instructions:

signmask DQ 8000000080000000h,8000000080000000h

xorps xmm0, [signmask]

; Toggle sign bits of all four floats.

Double Precision

To perform double-precision arithmetic, you can use the XORPD instruction—similar to the single- precision example—to flip the sign of packed double-precision floating-point operands. The XORPD instruction takes 3 cycles to execute, whereas the MULPD instruction requires 5 cycles.

signmask DQ 8000000000000000h,8000000000000000h

xorpd xmm0, [signmask]

; Toggle sign bit of both doubles.

Chapter 9

Optimizing with SIMD Instructions

215

Page 231
Image 231
AMD 250 manual Single Precision, Double Precision, 215