25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Table 18. SSE Instructions (Continued)

 

 

Encoding

Decode

 

 

 

Syntax

 

 

 

 

FPU pipe(s)

Latency

Note

Prefix

First

2nd

 

ModRM byte

type

 

 

 

 

 

byte

byte

byte

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PREFETCHT0 mem8

0Fh

18h

 

mm-001-xxx

DirectPath

~

~

5

 

 

 

 

 

 

 

 

 

PREFETCHT1 mem8

0Fh

18h

 

mm-010-xxx

DirectPath

~

~

5

 

 

 

 

 

 

 

 

 

PREFETCHT2 mem8

0Fh

18h

 

mm-011-xxx

DirectPath

~

~

5

 

 

 

 

 

 

 

 

 

PSADBW mmreg1,

0Fh

F6h

 

11-xxx-xxx

DirectPath

FADD

3

 

mmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PSADBW mmreg,

0Fh

F6h

 

mm-xxx-xxx

DirectPath

FADD

5

 

mem64

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PSHUFW mmreg1,

0Fh

70h

 

 

DirectPath

FADD/FMUL

2

 

mmreg2, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PSHUFW mmreg,

0Fh

70h

 

 

DirectPath

FADD/FMUL

4

 

mem64, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RCPPS xmmreg1,

0Fh

53h

 

11-xxx-xxx

Double

FMUL

4

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RCPPS xmmreg,

0Fh

53h

 

mm-xxx-xxx

Double

FMUL

6

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RCPSS xmmreg1,

F3h

0Fh

53h

11-xxx-xxx

DirectPath

FMUL

3

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RCPSS xmmreg,

F3h

0Fh

53h

mm-xxx-xxx

DirectPath

FMUL

5

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RSQRTPS xmmreg1,

0Fh

52h

 

11-xxx-xxx

Double

FMUL

4

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RSQRTPS xmmreg,

0Fh

52h

 

mm-xxx-xxx

Double

FMUL

6

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RSQRTSS xmmreg1,

F3h

0Fh

52h

11-xxx-xxx

DirectPath

FMUL

3

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RSQRTSS xmmreg,

F3h

0Fh

52h

mm-xxx-xxx

DirectPath

FMUL

5

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SFENCE

0Fh

AEh

 

11-111-000

VectorPath

 

2/8

6

 

 

 

 

 

 

 

 

 

Notes:

1. The low half of the result is available one cycle earlier than listed.

2. The second latency value indicates when the low half of the result becomes available.

3. The high half of the result is available one cycle earlier than listed.

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal pipeline conditions.

5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be prefetched.

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is visible to the other stores and instructions.

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory speed and the hardware implementation.

Appendix C

Instruction Latencies

323

Page 339
Image 339
AMD 250 manual 323, Sfence