306 Instruction Latencies Appendix C
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
PSUBSB mmreg1, mmreg2 0Fh E8h 11-xxx-xxx DirectPath FADD/FMUL 2
PSUBSB mmreg, mem64 0Fh E8h mm-xxx-xxx DirectPath FADD/FMUL 4
PSUBSW mmreg1, mmreg2 0Fh E9h 11-xxx-xxx DirectPath FADD/FMUL 2
PSUBSW mmreg, mem64 0Fh E9h mm-xxx-xxx DirectPath FADD/FMUL 4
PSUBUSB mmreg1, mmreg2 0Fh D8h 11-xxx-xxx DirectPath FADD/FMUL 2
PSUBUSB mmreg, mem64 0Fh D8h mm-xxx-xxx DirectPath FADD/FMUL 4
PSUBUSW mmreg1, mmreg2 0Fh D9h 11-xxx-xxx DirectPath FADD/FMUL 2
PSUBUSW mmreg, mem64 0Fh D9h mm-xxx-xxx DirectPath FADD/FMUL 4
PSUBW mmreg1, mmreg2 0Fh F9h 11-xxx-xxx DirectPath FADD/FMUL 2
PSUBW mmreg, mem64 0Fh F9h mm-xxx-xxx DirectPath FADD/FMUL 4
PUNPCKHBW mmreg1,
mmreg2
0Fh 68h 11-xxx-xxx DirectPath FADD/FMUL 2
PUNPCKHBW mmreg, mem64 0Fh 68h mm-xxx-xxx DirectPath FADD/FMUL 4
PUNPCKHDQ mmreg1,
mmreg2
0Fh 6Ah 11-xxx-xxx DirectPath FADD/FMUL 2
PUNPCKHDQ mmreg, mem64 0Fh 6Ah mm-xxx-xxx DirectPath FADD/FMUL 4
PUNPCKHWD mmreg1,
mmreg2
0Fh 69h 11-xxx-xxx DirectPath FADD/FMUL 2
PUNPCKHWD mmreg, mem64 0Fh 69h mm-xxx-xxx DirectPath FADD/FMUL 4
PUNPCKLBW mmreg1,
mmreg2
0Fh 60h 11-xxx-xxx DirectPath FADD/FMUL 2
PUNPCKLBW mmreg, mem64 0Fh 60h mm-xxx-xxx DirectPath FADD/FMUL 4
PUNPCKLDQ mmreg1,
mmreg2
0Fh 62h 11-xxx-xxx DirectPath FADD/FMUL 2
PUNPCKLDQ mmreg, mem64 0Fh 62h mm-xxx-xxx DirectPath FADD/FMUL 4
PUNPCKLWD mmreg1,
mmreg2
0Fh 61h 11-xxx-xxx DirectPath FADD/FMUL 2
PUNPCKLWD mmreg, mem64 0Fh 61h mm-xxx-xxx DirectPath FADD/FMUL 4
PXOR mmreg1, mmreg2 0Fh EFh 11-xxx-xxx DirectPath FADD/FMUL 2
PXOR mmreg, mem64 0Fh EFh mm-xxx-xxx DirectPath FADD/FMUL 4
Table 14. MMX™ Technology Instructions (Continued)
Syntax
Encoding
Decode
type FPU pipe(s) Latency Note
Prefix
byte
First
byte ModRM byte
Notes:
1. Bits 2, 1, and 0 of the ModRM byte select the integer register.
2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of
three per cycle and can use any of the three execution resources.