312

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Table 15. x87 Floating-Point Instructions (Continued)

		Encoding		Decode	FPU
Syntax						Latency	Note
	First	Second
			ModRM byte	type	pipe(s)
			ModRM byte	type	pipe(s)
	byte	byte
FSTENV [mem28byte]	D9h		mm-110-xxx	VectorPath	-	89

FSTP [mem32real]	D9h		mm-011-xxx	DirectPath	FADD/FMUL	2

FSTP [mem64real]	DDh		mm-011-xxx	DirectPath	FADD/FMUL	2

FSTP [mem80real]	D9h		mm-111-xxx	VectorPath	-	8

FSTP ST(i)	DDh		11-011-xxx	DirectPath	FADD/FMUL	2

FSTSW AX	DFh		11-100-000	VectorPath	-	12

FSTSW [mem16]	DDh		mm-111-xxx	VectorPath	FSTORE	8	3

FSUB [mem32real]	D8h		mm-100-xxx	DirectPath	FADD	6

FSUB [mem64real]	DCh		mm-100-xxx	DirectPath	FADD	6

FSUB ST, ST(i)	D8h		11-100-xxx	DirectPath	FADD	4	1

FSUB ST(i), ST	DCh		11-101-xxx	DirectPath	FADD	4	1

FSUBP ST(i), ST	DEh		11-101-xxx	DirectPath	FADD	4	1

FSUBR [mem32real]	D8h		mm-101-xxx	DirectPath	FADD	6

FSUBR [mem64real]	DCh		mm-101-xxx	DirectPath	FADD	6

FSUBR ST, ST(i)	D8h		11-100-xxx	DirectPath	FADD	4	1

FSUBR ST(i), ST	DCh		11-101-xxx	DirectPath	FADD	4	1

FSUBRP ST(i), ST	DEh		11-100-xxx	DirectPath	FADD	4	1

FTST	D9h		11-100-100	DirectPath	FADD	2

FUCOM	DDh		11-100-xxx	DirectPath	FADD	2

FUCOMI ST, ST(i)	DBh		11-101-xxx	VectorPath	FADD	3	3

FUCOMIP ST, ST(i)	DFh		11-101-xxx	VectorPath	FADD	3	3

FUCOMP	DDh		11-101-xxx	DirectPath	FADD	2

FUCOMPP	DAh		11-101-001	DirectPath	FADD	2

FWAIT	9Bh			DirectPath	-	0

FXAM	D9h		11-100-101	VectorPath	-	2

Notes:

1.The last three bits of the ModRM byte select the stack entry ST(i).

2.These instructions have an effective latency as shown. However, these instructions generate an internal NOP with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of three per cycle and can use any of the three execution resources.

3.This is a VectorPath decoded operation that uses one execution pipe (one ROP).

4.There is additional latency associated with this instruction. “e” represents the difference between the exponents of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then

n = (s+1)/2 where (0 <= n <= 32).

5.The latency provided for this operation is the best-case latency.

6.The three latency numbers represent the latency values for precision control settings of single precision, double precision, and extended precision, respectively.

Instruction Latencies

Appendix C

AMD 250 manual 312

Models: 250

Software Optimization Guide for AMD64 Processors

Table 15. x87 Floating-Point Instructions (Continued)

FPU

Syntax

FXAM

D9h

11-100-101

VectorPath

312