Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Table 18. SSE Instructions (Continued)

 

 

Encoding

Decode

 

 

 

Syntax

 

 

 

 

FPU pipe(s)

Latency

Note

Prefix

First

2nd

 

ModRM byte

type

 

 

 

 

 

byte

byte

byte

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVAPS xmmreg1,

0Fh

29h

 

11-xxx-xxx

Double

 

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVAPS mem128,

0Fh

29h

 

mm-xxx-xxx

Double

 

3

1

xmmreg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVHLPS xmmreg1,

0Fh

12h

 

11-xxx-xxx

DirectPath

 

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVHPS xmmreg,

0Fh

16h

 

mm-xxx-xxx

DirectPath

 

2

 

mem64

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVHPS mem64,

0Fh

17h

 

mm-xxx-xxx

DirectPath

 

2

 

xmmreg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVLHPS xmmreg1,

0Fh

16h

 

11-xxx-xxx

DirectPath

 

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVLPS xmmreg,

0Fh

12h

 

mm-xxx-xxx

DirectPath

 

2

 

mem64

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVLPS mem64,

0Fh

13h

 

mm-xxx-xxx

DirectPath

 

2

 

xmmreg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVMSKPS reg32,

0Fh

50h

 

11-xxx-xxx

VectorPath

 

3

 

xmmreg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVNTPS mem128,

0Fh

2Bh

 

mm-xxx-xxx

Double

 

3

7

xmmreg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVNTQ mem64,

0Fh

E7h

 

mm-xxx-xxx

DirectPath

FSTORE

2

7

mmreg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVSS xmmreg1,

F3h

0Fh

10h

11-xxx-xxx

DirectPath

 

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVSS xmmreg,

F3h

0Fh

10h

mm-xxx-xxx

Double

 

3

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVSS xmmreg1,

F3h

0Fh

11h

11-xxx-xxx

DirectPath

 

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notes:

1. The low half of the result is available one cycle earlier than listed.

2. The second latency value indicates when the low half of the result becomes available.

3. The high half of the result is available one cycle earlier than listed.

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal pipeline conditions.

5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be prefetched.

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is visible to the other stores and instructions.

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory speed and the hardware implementation.

320

Instruction Latencies

Appendix C

Page 336
Image 336
AMD 250 manual 320