estate to build increasingly complex processors, with instruction-level parallelism (ILP) as a goal. Today these traditional processors employ very high frequencies along with a variety of sophisticated tactics to accelerate a single instruction pipeline, including:
•Large caches
•Superscalar designs
•Out-of-order execution
•Very high clock rates
•Deep pipelines
•Speculative pre-fetches
While these techniques have produced faster processors with impressive-sounding multiple-gigahertz frequencies, they have largely resulted in complex, hot, and power- hungry processors that are not well suited to the types of workloads often found in modern datacenters. In fact, many datacenter workloads are simply unable to take advantage of the hard-won ILP provided by these processors. Applications with high shared memory and high simultaneous user or transaction counts are typically more focused on processing a large number of simultaneous threads (thread-level parallelism, TLP) rather than running a single thread as quickly as possible (ILP).
Making matters worse, the majority of ILP in existing applications has already been extracted and further gains promise to be small. In addition, microprocessor frequency scaling itself has leveled off because of microprocessor power issues. With higher clock speeds, each successive processor generation has seemingly demanded more power than the last, and microprocessor frequency scaling has leveled off in the 2-3 GHz range as a result. Deploying pipelined Superscalar processors requires more power, limiting this approach by the fundamental ability to cool the processors.
Chip Multiprocessing with Multicore Processors
To address these issues, many in the microprocessor industry have used the transistor budget provided by Moore's Law to group two or even four conventional processor cores on a single physical die — creating multicore processors (or chip multiprocessors, CMP). The individual processor cores introduced by many CMP designs have no greater performance than previous single-processor chips, and in fact, have been observed to run single-threaded applications more slowly than single-core processor versions. However, the aggregate chip performance increases since multiple programs (or multiple threads) can be accommodated in parallel (thread level parallelism).
Unfortunately, most currently-available (or soon to be available) chip multiprocessors simply replicate cores from existing (single-threaded) processor designs. This approach typically yields only slight improvements in aggregate performance since it ignores key performance issues such as memory speed and hardware thread context switching. As