MCF548x Reference Manual, Rev. 3
3-4 Freescale Semiconductor
3.2.1.1.1 Branch Acceleration
To maximize the performance of conditional branch instructions, the IFP implements a sophisticated
two-level acceleration mechanism. The first level is an 8-entry, direct-mapped branch cache with 2 bits for
indicating four prediction states (strongly or weakly; taken or not-taken) for each entry. The branch cache
also provides the association between instruction addresses and the corresponding target address. In the
event of a branch cache hit, if the branch is predicted as taken, the branch cache sources the target address
from the IC1 stage back into the IAG to redirect the prefetch stream to the new location.
The branch cache implements instruction folding, so conditional branch instructions correctly predicted as
taken can execute in zero cycles. For conditional branches with no information in the branch cache, a
second-level, direct-mapped prediction table is accessed. Each of its 128 entries uses the same 2-bit
prediction mechanism as the branch cache.
If a branch is predicted as taken, branch acceleration logic in the IED stage generates the target address.
Other change-of-flow instructions, including unconditional branches, jumps, and subroutine calls, use a
similar mechanism where the IFP calculates the target address. The performance of subroutine return
instruction (RTS) is improved through the use of a four-entry, LIFO hardware return stack. In all cases,
these mechanisms allow the IFP to redirect the fetch stream down the predicted path well ahead of
instruction execution.

3.2.1.2 Operand Execution Pipeline (OEP)

The two instruction registers in the decode stage (DS) of the OEP are loaded from the FIFO instruction
buffer or are bypassed directly from the instruction early decode (IED). The OEP consists of two
traditional, two-stage RISC compute engines with a dual-ported register file access feeding an arithmetic
logic unit (ALU).
The compute engine at the top of the OEP (the address ALU) is used typically for operand address
calculations; the execution ALU at the bottom is used for instruction execution. The resulting structure
provides 4 Gbytes/S operand bandwidth (at 162 MHz) to the two compute engines and supports
single-cycle execution speeds for most instructions, including all load and store operations and most
embedded-load operations. The V4 OEP supports the ColdFire Revision B instruction set, which adds a
few new instructions to improve performance and code density.
The OEP also implements the following advanced performance features:
Stalls are minimized by dynamically basing the choice between the address ALU or execution
ALU for instruction execution on the pipeline state.
The address ALU and register renaming resources together can execute heavily used opcodes and
forward results to subsequent instructions with no pipeline stalls.
Instruction folding involving MOVE instructions allows two instructions to be issued in one cycle.
The resulting microarchitecture approaches full superscalar performance at a much lower silicon
cost.
3.2.1.2.1 Illegal Opcode Handling
To aid in conversion from M68000 code, every 16-bit operation word is decoded to ensure that each
instruction is valid. If the processor attempts execution of an illegal or unsupported instruction, an illegal
instruction exception (vector 4) is taken.
3.2.1.2.2 Enhanced Multiply/Accumulate (EMAC) Unit
The EMAC unit in the Version 4e provides hardware support for a limited set of digital signal processing
(DSP) operations used in embedded code, while supporting the integer multiply instructions in the