User’s Manual

IBM PowerPC 750GX and 750GL RISC Microprocessor

Performance features such as branch folding, BTIC, dynamic branch prediction (implemented in the BHT),

2-level branch prediction, and the implementation of nonblocking caches minimize the penalties associated with flow-control operations on the 750GX. The timing for branch instruction execution is determined by many factors including:

Whether the branch is taken

Whether instructions in the target stream, typically the first two instructions in the target stream, are in the branch target instruction cache (BTIC)

Whether the target instruction stream is in the L1 cache

Whether the branch is predicted

Whether the prediction is correct

6.4.1.1 Branch Folding

When a branch instruction is encountered by the fetcher, the BPU immediately begins to decode it and tries to resolve it. Branch folding is the removal of branches from the instruction stream. This is independent of whether the branch is taken or not taken. However, if the branch instruction updates either the LR or CTR it cannot be removed and must be allocated a position in the completion queue. If a branch cannot be resolved immediately, it is predicted and instruction fetching resumes along the predicted path. Those instructions are conditionally fed into the instruction queue. Later, if the prediction is finally correctly resolved, the fetched instructions are validated and allowed to complete and be retired. If the prediction is resolved incorrectly, then the instructions fetched are invalidated, and instruction fetching resumes along the other path of the branch.

Figure 6-7on page 227 shows branch folding. Here a b instruction is encountered in a series of add instruc- tions. The branch is resolved as taken. What happens on the next clock cycle depends on whether the target instruction stream is in the BTIC, the instruction L1 cache, or if it must be fetched from the L2 cache or from system memory.

Figure 6-7shows cases where there is a BTIC hit, and where there is a BTIC miss (and instruction-cache hit). If there is a BTIC hit on the next clock cycle, the bx instruction is replaced by the target instruction, and1, which was found in the BTIC. The second and instruction is also fetched from the BTIC. On the next clock cycle, the next four and instructions from the target stream are fetched from the instruction cache.

If the target instruction is not in the BTIC, there is an idle cycle while the fetcher attempts to fetch the first four instructions from the instruction cache (on the next clock cycle). In the example in Figure 6-7, the first four target instruction are fetched on the next clock.

If the target instruction misses in the BTIC or L1 caches, an L2 cache or memory access is required. The latency of this access is dependent on several factors, such as processor/bus clock ratios. In most cases, new instructions arrive in the IQ before the execution units become idle.

Instruction Timing

gx_06.fm.(1.2)

Page 226 of 377

March 27, 2006