16 | The UltraSPARC T2 Processor with CoolThreads Technology | Sun Microsystems, Inc. |
An eight-stage integer pipeline and a 12-stage floating-point pipeline are provided by each UltraSPARC processor core (Figure 7). A new “pick” pipeline stage has been added to choose two threads (out of the eight possible per core) to execute each cycle.
Fetch Cache Pick Decode Execute Mem Bypass W
Eight-Stage Integer Pipeline
Fetch | Cache | Pick | Decode | Execute | Fx1 | Fx2 | Fx3 | Fx4 | Fx5 | Fx6 | FW |
| | | | | | | | | | | |
Twelve-Stage Floating-Point Pipeline
Figure 7. UltraSPARC T2 per-core integer and floating-point pipelines
To illustrate how the dual pipelines function, Figure 8 depicts the integer pipeline with the load store unit (LSU). The instruction cache is shared by all eight threads within the core. A least-recently-fetched algorithm is used to select the next thread to fetch. Each thread is written into a thread-specific instruction buffer (IB) and each of the eight threads is statically assigned to one of two thread groups within the core.
IB0-3
P0
D2
E0
M3
B1
W2
Thread Group 0
IB4-7
P5
D7
E6
M4
B7
W6
Thread Group 1
Figure 8. Threads are interleaved between pipeline stages with very few restrictions (integer pipeline shown, letters depict pipeline stages, numbers depict different scheduled threads)
The “pick” stage chooses one thread each cycle within each thread group. Picking within each thread group is independent of the other, and a least-recently-picked algorithm is used to select the next thread to execute. The decode state resolves resource conflicts that are not handled during the pick stage. As shown in the illustration, threads are interleaved between pipeline stages with very few restrictions. Any thread can be at the fetch or cache stage, before being split into either of the two thread groups. Load/store and floating point units are shared between all eight threads. Only one thread from either thread group can be scheduled on such a shared unit.