Instruction Timing

5.3.2.2 Delayed COF

When a change-of-flow instruction is executed, the core must wait for the pipeline to fill, starting with a new pre-fetch from memory. A delay slot is the next VLES after a delayed change-of-flow instruction. Since it is possible to use the delay slots of the change-of-flow operation to continue the execution of the previously fetched instructions, special delayed instructions are added to the instruction set. These instructions use part or all of the delay cycles to execute one additional execution set. This effectively reduces the penalty for utilizing a change-of-flow operation. If the additional execution set in the delay slot is included in the cycle count, the number of cycles for the change-of-flow instruction are effectively reduced. Refer to Section 5.3.2, “Change-Of-Flow Instruction Timing,” on page 5-17 for further details.

5.3.2.3 COF Execution Cycles

The basic change-of-flow JMP instruction takes three cycles to execute. However, the number of cycles is different for the following change-of-flow instructions:

PC-relative instructions such as BRA require an additional cycle to calculate the destination.

Delayed instructions such as JMPD effectively require the same cycle count as the non-delayed version (in this example JMP) minus the execution cycle count of the set in the delay slot. This is the case because the pipeline fill-up time is used to execute a useful execution set. The actual time taken to jump to the new address is the same for the delayed or non-delayed version. However, the effective cycle count is less for the delayed version since the execution of the instructions in the delay slot would be extra counts if the non-delayed version was used.

The delay slot lasts for the full execution time of the set in the delay slot, which may be more than one cycle. The minimum execution time of a delayed instruction is one cycle. For example:

JMPD dest; takes 1 cycle (3-2=1), because the next instruction MOVE.W d0,(sp + xxx) ; takes 2 cycles

Stalls that originate in delay slot instructions, and are caused by a memory access wait-state or contention, stall the whole core, and are not deducted from the cycle count.

Conditional change-of-flow instructions (JT/JF/BT/BF) require four cycles to execute (if taken), and one cycle to execute (if not taken).

The core implements a mechanism for fast return from subroutine. The return address of subroutines is kept in a hidden return address stack (RAS) register in addition to being pushed to the stack. This saves the need to read it from the stack in memory upon return. However, this hidden register is not valid if there was another jump to a subroutine before the return, in which case, the core adds two cycles to the RTS instruction to read the return address from the stack. Refer to Section 5.5.5, “Fast Return from Subroutines,” for a more detailed description of the fast return mechanism.

The core keeps a “shadow” version of SP-8 to save pre-calculation time in case of a POP. If SP was explicitly changed by a TFRA or an AGU arithmetic instruction, the shadow SP is not valid and another cycle is needed for the first POP pre-calculation (or equivalent, such as RTE). Refer to Section 5.5.4, “Shadow Stack Pointer Registers,” for a more detailed description of the shadow SP mechanism.

A change-of-flow instruction (jump, branch, interrupt, or long loop iteration) made to an execution set destination that is spread over two fetch sets, requires an additional cycle for memory access. An execution set is not necessarily aligned to a fetch set, and can overlap two fetch sets. The core keeps two fetch sets in a buffer, so this is not normally a problem. However, when a

SC140 DSP Core Reference Manual

5-19

Page 199
Image 199
Freescale Semiconductor SC140 specifications Delayed COF, COF Execution Cycles