25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

can execute out-of-order. In addition, a particular integer pipe can execute two micro-ops from different macro-ops (one in the ALU and one in the AGU) at the same time. See Figure 7 on page 256.

Each of the three ALUs performs general purpose logic functions, arithmetic functions, conditional functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate the logical addresses for loads, stores, and LEAs. A load and store unit reads and writes data to and from the L1 data cache. The integer scheduler sends a completion status to the ICU when the outstanding micro-ops for a given macro-op are executed.

All integer operations can be handled within any of the three ALUs with the exception of multiplies. Multiplies are handled by a pipelined multiplier that is attached to the pipeline at pipe 0, as shown in Figure 7. Multiplies always issue to integer pipe 0, and the issue logic creates results bus bubbles for the multiplier in integer pipes 0 and 1 by preventing non-multiply micro-ops from issuing at the appropriate time.

A.13 Floating-Point Scheduler

The floating-point logic of the AMD Athlon 64 and AMD Opteron processors is a high-performance, fully pipelined, superscalar, out-of-order execution unit. It is capable of accepting three macro-ops per cycle from any mixture of the following types of instructions:

x87 floating-point

3DNow! technology

MMX technology

SSE

SSE2

The floating-point scheduler handles register renaming and has a dedicated 36-entry scheduler buffer organized as 12 lines of three macro-ops each. It also performs data superforwarding, micro-op issue, and out-of-order execution. The floating-point scheduler communicates with the ICU to retire a macro-op, to manage comparison results from the FCOMI instruction, and to back out results from a branch misprediction.

Superforwarding is a performance optimization. It allows a floating point operation having a dependency on a register to be scheduled sooner when that register is waiting to be filled by a pure load from memory. Instead of waiting for the first instruction to write its load-data to the register and then waiting for the second instruction to read it, the load-data can be provided directly to the dependent instruction, much like regular forwarding between FPU-only operations. The result from the load is said to be "superforwarded" to the floating-point operation. In the following example, the FADD can be scheduled to execute as soon as the load operation fetches its data rather than having to wait and read it out of the register file.

Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors

257

Page 273
Image 273
AMD 250 manual Floating-Point Scheduler, 257