Compiler Support on StarCore
Good To Know
•The use of four variables removes the accumulation dependency that is required for parallelism.
•Bit exact considerations must be understood if this technique is used: overflow/saturation characteristics may change during split summation.
6 Multi-Sample Exercise
The multi-sample exercise demonstrates the multisample technique. As the exercise in Section 5 shows, the split summation technique allows a sum of products operation to be calculated using all four ALUs by evaluating four intermediate products at a time. However, it does not guarantee bit-exact agreement with serially accumulating each intermediate product using a single ALU. To ensure bit-exactness, the order of summation must be preserved by performing each intermediate product/accumulation in turn.Therefore, the intermediate products cannot be evaluated in parallel. Furthermore, the split summation technique may not be suited for the application. Other techniques can be used where it is possible to evaluate one intermediate product from each of four output sample calculations in parallel. Consider the FIR filtering operation described by Equation 4:
N – 1 | | |
y(n) = ∑ ai x(n – i) , | for 0 ≤ n < L | (4) |
i = 0
A C code implementation of this operation typically resembles the implementation of Ex6.c. To use all four ALUs, the operations can be grouped as illustrated in the following equation:
y(n) = a0x(n) y(n + 1) = a0x(n + 1) y(n + 2) = a0x(n + 2) y(n + 3) = a0x(n + 3)
+a1x(n – 1)
+a1x(n)
+a1x(n + 1)
+a1x(n + 2)
+a2x(n – 2) + a3x(n – 3) + … + aN – 2x(n – N + 2) + aN – 1x(n – N + 1)
| | | | |
| + a2x(n – 1) + a3x(n – 2) + … + aN – 2x(n – N + 3) + aN – 1x(n – N + 2) | (5) |
| + a2x(n) | + a3x(n – 1) + … + aN – 2x(n – N + 4) + aN – 1x(n – N + 3) |
| |
| + a2x(n + 1) + a3x(n) | + … + aN – 2x(n – N + 5) + aN – 1x(n – N + 4) | |
Group 0 Group 1 Group 2 Group 3 | Group N-2 | Group N-1 |
In Equation 5, the products and accumulations within each group are calculated in parallel, but the groups themselves are evaluated in sequence, thus preserving the order of accumulation, which in turn preserves the bit-exactness of Equation 4. Therefore, parallelization is achieved by processing multiple samples in parallel rather than multiple intermediate products belonging to only one output sample. When one group (for example, Group 2) is evaluated, only two words of data need to be loaded for the next group (Group 3): a3 and x(n-3). The other values needed for the calculations in Group 3—x(n-2), x(n-1), and x(n)—should already exist in the DSP registers from the calculation of Group 2. The result is a reduction in memory bandwidth requirements that increases code efficiency.
18 | Introduction to the SC140 Tools |