The TigerSHARC DSP uses a Static SuperscalarTM† architecture. This architecture is superscalar in that the ADSP-TS201S pro- cessor’s core can execute simultaneously from one to four 32-bit instructions encoded in a very large instruction word (VLIW) instruction line using the DSP’s dual compute blocks. Because the DSP does not perform instruction re-ordering at runtime— the programmer selects which operations will execute in parallel prior to runtime—the order of instructions is static.
With few exceptions, an instruction line, whether it contains one, two, three, or four 32-bit instructions, executes with a throughput of one cycle in a 10-deep processor pipeline.
For optimal DSP program execution, programmers must follow the DSP’s set of instruction parallelism rules when encoding an instruction line. In general, the selection of instructions that the DSP can execute in parallel each cycle depends on the instruc- tion line resources each instruction requires and on the source and destination registers used in the instructions. The program- mer has direct control of three core components—the IALUs, the compute blocks, and the program sequencer.
The ADSP-TS201S processor, in most cases, has a two-cycle execution pipeline that is fully interlocked, so—whenever a computation result is unavailable for another operation depen- dent on it—the DSP automatically inserts one or more stall cycles as needed. Efficient programming with dependency-free instructions can eliminate most computational and memory transfer data dependencies.
In addition, the ADSP-TS201S processor supports SIMD opera- tions two ways—SIMD compute blocks and SIMD computations. The programmer can load both compute blocks with the same data (broadcast distribution) or different data (merged distribution).
DUAL COMPUTE BLOCKS
The ADSP-TS201S processor has compute blocks that can exe- cute computations either independently or together as a single- instruction, multiple-data (SIMD) engine. The DSP can issue up to two compute instructions per compute block each cycle, instructing the ALU, multiplier, shifter, or CLU to perform independent, simultaneous operations. Each compute block can execute eight 8-bit, four 16-bit, two 32-bit, or one 64-bit SIMD computations in parallel with the operation in the other block. These computation units support IEEE 32-bit single-precision floating-point, extended-precision 40-bit floating point, and 8-, 16-, 32-, and 64-bit fixed-point processing.
The compute blocks are referred to as X and Y in assembly syn- tax, and each block contains four computational units—an ALU, a multiplier, a 64-bit shifter, a 128-bit CLU—and a 32- word register file.
•Register File—each compute block has a multiported 32- word, fully orthogonal register file used for transferring data between the computation units and data buses and for
†Static Superscalar is a trademark of Analog Devices, Inc.
storing intermediate results. Instructions can access the registers in the register file individually (word-aligned), in sets of two (dual-aligned), or in sets of four (quad-aligned).
•ALU—the ALU performs a standard set of arithmetic oper- ations in both fixed- and floating-point formats. It also performs logic operations.
•Multiplier—the multiplier performs both fixed- and float- ing-point multiplication and fixed-point multiply and accumulate.
•Shifter—the 64-bit shifter performs logical and arithmetic shifts, bit and bit stream manipulation, and field deposit and extraction operations.
•Communications Logic Unit (CLU)—this 128-bit unit pro- vides trellis decoding (for example, Viterbi and Turbo decoders) and executes complex correlations for CDMA communication applications (for example, chip-rate and symbol-rate functions).
Using these features, the compute blocks can:
•Provide 8 MACS per cycle peak and 7.1 MACS per cycle sustained 16-bit performance and provide 2 MACS per cycle peak and 1.8 MACS per cycle sustained 32-bit perfor- mance (based on FIR)
•Execute six single-precision floating-point or execute 24 fixed-point (16-bit) operations per cycle, providing
3.6G FLOPS or 14.4G/s regular operations performance at
600 MHz
•Perform two complex 16-bit MACS per cycle
•Execute eight trellis butterflies in one cycle
DATA ALIGNMENT BUFFER (DAB)
The DAB is a quad-word FIFO that enables loading of quad- word data from nonaligned addresses. Normally, load instruc- tions must be aligned to their data size so that quad words are loaded from a quad-aligned address. Using the DAB signifi- cantly improves the efficiency of some applications, such as FIR filters.
DUAL INTEGER ALU (IALU)
The ADSP-TS201S processor has two IALUs that provide pow- erful address generation capabilities and perform many general- purpose integer operations. The IALUs are referred to as J and K in assembly syntax and have the following features:
•Provide memory addresses for data and update pointers
•Support circular buffering and bit-reverse addressing
•Perform general-purpose integer operations, increasing programming flexibility
•Include a 31-word register file for each IALU
As address generators, the IALUs perform immediate or indi- rect (pre- and post-modify) addressing. They perform modulus and bit-reverse operations with no constraints placed on mem- ory addresses for the modulus data buffer placement. Each IALU can specify either a single-, dual-, or quad-word access from memory.