25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

C.7

SSE Instructions

 

 

 

 

 

Table 18.

SSE Instructions

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Encoding

Decode

 

 

 

Syntax

 

 

 

 

 

FPU pipe(s)

Latency

Note

 

Prefix

First

2nd

 

 

ModRM byte

type

 

 

 

 

 

 

 

byte

byte

byte

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ADDPS xmmreg1,

0Fh

58h

 

11-xxx-xxx

Double

FADD

5

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ADDPS xmmreg,

0Fh

58h

 

mm-xxx-xxx

Double

FADD

7

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ADDSS xmmreg1,

F3h

0Fh

58h

11-xxx-xxx

DirectPath

FADD

4

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ADDSS xmmreg,

F3h

0Fh

58h

mm-xxx-xxx

DirectPath

FADD

6

 

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ANDNPS xmmreg1,

0Fh

55h

 

11-xxx-xxx

Double

FMUL

3

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ANDNPS xmmreg,

0Fh

55h

 

mm-xxx-xxx

Double

FMUL

5

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ANDPS xmmreg1,

0Fh

54h

 

11-xxx-xxx

Double

FMUL

3

1

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ANDPS xmmreg,

0Fh

54h

 

mm-xxx-xxx

Double

FMUL

5

1

mem128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CMPPS xmmreg1,

0Fh

C2h

 

11-xxx-xxx

Double

FADD

3

1

xmmreg2, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CMPPS xmmreg,

0Fh

C2h

 

mm-xxx-xxx

Double

FADD

5

1

mem128, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CMPSS xmmreg1,

F3h

0Fh

C2h

11-xxx-xxx

DirectPath

FADD

2

 

xmmreg2, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CMPSS xmmreg,

F3h

0Fh

C2h

mm-xxx-xxx

DirectPath

FADD

4

 

mem32, imm8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

COMISS xmmreg1,

0Fh

2Fh

 

11-xxx-xxx

VectorPath

 

4

 

xmmreg2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notes:

 

 

 

 

 

 

 

 

 

1. The low half of the result is available one cycle earlier than listed.

 

 

 

2. The second latency value indicates when the low half of the result becomes available.

 

 

3. The high half of the result is available one cycle earlier than listed.

 

 

 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal

pipeline conditions.

 

 

 

 

 

 

 

 

5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be

prefetched.

 

 

 

 

 

 

 

 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is

visible to the other stores and instructions.

 

 

 

 

 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory

speed and the hardware implementation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Appendix C

Instruction Latencies

317

Page 333
Image 333
AMD 250 manual SSE Instructions, 317