Further Speed Optimization

Compiler Support on StarCore

C Code

Generated Assembly Code

The register-to-register transfers can be eliminated by expanding the inner loop so that each group of four MAC instructions uses the data registers already containing the required data values. This yields faster code, but code size is greater.

9.Save Ex6_1.c as Ex6_2.c.

10.In Ex6_2.c, “unroll” the inner loop instructions four times so that the first four groups (Group 0, Group 1, Group 2, and Group 3) are all processed in the loop. This loop expansion avoids transferring data. You must reduce the number of loop iterations by a factor of four to compensate for the fact that the loop is unrolled by a factor of 4.

If your inner loop consumes just four cycles, and your code still produces the correct output, congratulations. You have completed Exercise 6.

Notice that each group of four MAC operations and two data load operations now requires just one processor cycle, which is half the time required by the filtering operation and a quarter of the time required by a single-ALU DSP device. However, the code size for the inner loop has increased by a significant amount (approximately four times that of the second implementation), and this must be weighed up against the cycle-count performance improvements obtained. Table 3 summarizes the main characteristics of the multi-sample technique.

Table 3. Inner Loop Characteristics of Multi-sample and Single-sample Techniques.

Characteristic	Single-sample Algorithm	Multi-sample Algorithm


Cycle count	N	N/4

Registers used	Fewer	More

Sample delay	1	4

20	Introduction to the SC140 Tools