25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 4 Instruction-Decoding

Optimizations

The optimizations in this chapter are designed to help maximize the number of instructions that the processor can decode at one time.

The instruction fetcher of both the AMD Athlon™ 64 and AMD Opteron™ processors reads 16-byte packets from the L1 instruction cache. These packets are 16-byte aligned. The instruction bytes are then merged into a 32-byte pick window. On each cycle, the in-order front-end engine selects up to three AMD64 instructions for decode from the pick window.

This chapter covers the following topics:

Topic

Page

 

 

DirectPath Instructions

72

 

 

Load-Execute Instructions

73

 

 

Load-Execute Integer Instructions

73

 

 

Load-ExecuteFloating-Point Instructions with Floating-Point Operands

74

 

 

Load-ExecuteFloating-Point Instructions with Integer Operands

74

 

 

Branch Targets in Program Hot Spots

76

 

 

32/64-Bit vs. 16-Bit Forms of the LEA Instruction

77

 

 

Short Instruction Encodings

80

 

 

Partial-Register Reads and Writes

81

 

 

Using LEAVE for Function Epilogues

83

 

 

Alternatives to SHLD Instruction

85

 

 

8-BitSign-Extended Immediate Values

87

 

 

8-BitSign-Extended Displacements

88

 

 

Code Padding with Operand-Size Override and NOP

89

 

 

Chapter 4

Instruction-Decoding Optimizations

71

Page 87
Image 87
AMD 250 manual Instruction-Decoding Optimizations