Intel 80200, Processor B.2.4 Memory Pipeline, B.2.5 Multiply/Multiply Accumulate (MAC) Pipeline

B-8 March, 2003 Developer’s Manual

Intel® 80200 Processor based on Intel® XScale™ Microarchitecture

Optimization Guide

B.2.4 Memory Pipeline

The memory pipeline consists of two stages, D1 and D2. The data cache unit, or DCU, consists of

the data-cache array, mini-data cache, fill buffers, and writebuffers. The memory pipeline handles

load / store instructions.

Operation begins in D1 after the X1 pipestage has calculated the effective address for load/stores.

The data cache and mini-data cache returns the destination data in the D2 pipestage. Before data is

returned in the D2 pipestage, sign extension and byte alignment occurs for byte and half-word

loads.

B.2.5 Multiply/Multiply Accumulate (MAC) Pipeline

The Multiply-Accumulate (MAC) unit executes the multiply and multiply-accumulate instructions

supported by the Intel® 80200 processor core. The MAC implements the 40-bit Intel® 80200

processor accumulator register acc0 and handles the instructions, which transfer its value to and

from general-purpose ARM registers.

The following are important characteristics about the MAC:

•The MAC is not truly pipelined, as the processing of a single instruction may require use of the

same datapath resources for several cycles before a new instruction can be accepted. The type

of instruction and source arguments determines the number of cycles required.

•No more than two instructions can occupy the MAC pipeline concurrently.

•When the MAC is processing an instruction, another instruction may not enter M1 unless the

original instruction completes in the next cycle.

•The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and

memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.

•The MAC can achieve throughput of one multiply per cycle when performing a 16 by 32 bit

multiply.

The execution of the MAC unit starts at the beginning of the M1 pipestage, where it receives two

32-bit source operands. Results are completed N cycles later (where N is dependent on the operand

size) and returned to the register file. For more information on MAC instruction latencies, refer to

Section 14.4, “Instruction Latencies”.

An instruction that occupies the M1 or M2 pipestages also occupy the X1 and X2 pipestage,

respectively. Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may

complete anywhere from M2-M5. If a MAC operation enters M3-M5, it is considered committed

because it modifies architectural state regardless of subsequent events.