25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Table 18. SSE Instructions (Continued)

 

 

Encoding

Decode

 

 

 

Syntax

 

 

 

 

FPU pipe(s)

Latency

Note

Prefix

First

2nd

 

ModRM byte

type

 

 

 

 

 

byte

byte

byte

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DIVPS xmmreg,

0Fh

5Eh

 

mm-xxx-xxx

Double

FMUL

35

 

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DIVSS xmmreg1,

F3h

0Fh

5Eh

11-xxx-xxx

DirectPath

FMUL

16

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DIVSS xmmreg, mem32

F3h

0Fh

5Eh

mm-xxx-xxx

DirectPath

FMUL

18

 

 

 

 

 

 

 

 

 

 

LDMXCSR mem32

0Fh

AEh

 

mm-010-xxx

VectorPath

 

13

4

 

 

 

 

 

 

 

 

 

MASKMOVQ mmreg1,

0Fh

F7h

 

11-xxx-xxx

VectorPath

FADD/FMUL/

29

 

mmreg2

 

 

 

 

 

FSTORE

 

 

 

 

 

 

 

 

 

 

 

MAXPS xmmreg1,

0Fh

5Fh

 

11-xxx-xxx

Double

FADD

3

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MAXPS xmmreg,

0Fh

5Fh

 

mm-xxx-xxx

Double

FADD

5

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MAXSS xmmreg1,

F3h

0Fh

5Fh

11-xxx-xxx

DirectPath

FADD

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MAXSS xmmreg,

F3h

0Fh

5Fh

mm-xxx-xxx

DirectPath

FADD

4

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MINPS xmmreg1,

0Fh

5Dh

 

11-xxx-xxx

Double

FADD

3

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MINPS xmmreg,

0Fh

5Dh

 

mm-xxx-xxx

Double

FADD

5

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MINSS xmmreg1,

F3h

0Fh

5Dh

11-xxx-xxx

DirectPath

FADD

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MINSS xmmreg,

F3h

0Fh

5Dh

mm-xxx-xxx

DirectPath

FADD

4

 

mem32

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVAPS xmmreg1,

0Fh

28h

 

11-xxx-xxx

Double

 

2

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MOVAPS xmmreg,

0Fh

28h

 

mm-xxx-xxx

Double

 

2

 

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notes:

1. The low half of the result is available one cycle earlier than listed.

2. The second latency value indicates when the low half of the result becomes available.

3. The high half of the result is available one cycle earlier than listed.

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal pipeline conditions.

5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be prefetched.

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is visible to the other stores and instructions.

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory speed and the hardware implementation.

Appendix C

Instruction Latencies

319

Page 335
Image 335
AMD 250 manual 319, Fadd/Fmul