Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Table 18. SSE Instructions (Continued)

 

 

Encoding

Decode

 

 

 

Syntax

 

 

 

 

FPU pipe(s)

Latency

Note

Prefix

First

2nd

 

ModRM byte

type

 

 

 

 

 

byte

byte

byte

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SHUFPS xmmreg1,

0Fh

C6h

 

11-xxx-xxx

VectorPath

FMUL

4

1

xmmreg2, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SHUFPS xmmreg,

0Fh

C6h

 

mm-xxx-xxx

VectorPath

FMUL

6

2

mem128, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SQRTPS xmmreg1,

0Fh

51h

 

11-xxx-xxx

Double

FMUL

39

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SQRTPS xmmreg,

0Fh

51h

 

mm-xxx-xxx

Double

FMUL

41

 

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SQRTSS xmmreg1,

F3h

0Fh

51h

11-xxx-xxx

DirectPath

FMUL

19

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SQRTSS xmmreg,

F3h

0Fh

51h

mm-xxx-xxx

DirectPath

FMUL

21

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

STMXCSR mem32

0Fh

AEh

 

mm-011-xxx

VectorPath

 

11

4

 

 

 

 

 

 

 

 

 

SUBPS xmmreg1,

0Fh

5Ch

 

11-xxx-xxx

Double

FADD

5

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SUBPS xmmreg,

0Fh

5Ch

 

mm-xxx-xxx

Double

FADD

7

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SUBSS xmmreg1,

F3h

0Fh

5Ch

11-xxx-xxx

DirectPath

FADD

4

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SUBSS xmmreg,

F3h

0Fh

5Ch

mm-xxx-xxx

DirectPath

FADD

6

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

UCOMISS xmmreg1,

0Fh

2Eh

 

11-xxx-xxx

VectorPath

 

4

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

UCOMISS xmmreg,

0Fh

2Eh

 

mm-xxx-xxx

VectorPath

 

6

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

UNPCKHPS xmmreg1,

0Fh

15h

 

11-xxx-xxx

Double

FMUL

3

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

UNPCKHPS xmmreg,

0Fh

15h

 

mm-xxx-xxx

Double

FMUL

5

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notes:

1. The low half of the result is available one cycle earlier than listed.

2. The second latency value indicates when the low half of the result becomes available.

3. The high half of the result is available one cycle earlier than listed.

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal pipeline conditions.

5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be prefetched.

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is visible to the other stores and instructions.

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory speed and the hardware implementation.

324

Instruction Latencies

Appendix C

Page 340
Image 340
AMD 250 manual 324