VLC Architecture, Crossbar-less configuration

VLC Architecture

High-speed / low latency Intra-Cell cache-to-cache data transfer

The Express5800/1000 series server implements the VLC architecture, which allows for low latency cache-to-cache data transfer between multiple CPUs within a cell.

In a split BUS architecture, for a cache- to-cache data transfer to take place, the data must be passed through a chipset. However, in the VLC architecture, data within the cache memory can

be accessed directly by one another, bypassing the chipset. This allows for lower latency between the cache memory, which results in faster data transfers.

Very Large Cache (VLC) Architecture

Increased enterprise

CPU

applications

Cache

Memory

performance through

reduced cache memory

access latency

Memory

chipset

Direct CPU-to-CPU transfers

FSB

Intel® Itanium® 2 processor

(Madison : L3 9MB)

Latency

High-speed

cache-to-cache

L3 of other CPU

transfers

CPU

Cache

Memory

Data Size

Dual-Core Intel® Itanium® processor

(Montvale : L3 24MB)

Latency

L3 of other CPU

CPU

Cache

Memory

Data Size

Split BUS Architecture

Higher cache memory

CPU

access latency.

Non-uniform

Cache

cache-to-cache data

Memory

transfer.

Inconsistent

performance.

Memory

Data transfer controller

chipset

Overhead

from transferring

Intel® Itanium® 2 processor

data

through the chipset.

(Madison : L3 9MB)

FSB

chipset

FSB

Latency

		L3 of other		Latency
	L3 of	CPU on		degradation
other CPU on		different FSB		(approx 3x)
same FSB					This area increases
CPU	Cache	Cache	Cache		due to the increase in
L3	Memory Memory Memory			Data Size	cache size and
				Data Size	cache size and
Dual-Core Intel® Itanium® processor					higher latency
(Montvale : L3 24MB)
Latency

			Higher
	L3 of other CPU on		Higher
L3 of other CPU	L3 of other CPU on		latency
L3 of other CPU		different FSB	(approx 3x)
on same FSB

CPU	Cache	Cache	Cache
L3	Memory	Memory	Memory	Data Size
				Data Size

This image does not depict actual numbers

Dedicated Cache Coherency Interface (CCI)

High-speed / low latency Inter-Cell cache-to-cache data transfer

Another technology implemented in the Express5800/1000 series server to improve cache-to-cache data transfer is the Cache Coherency Interface (CCI). CCI, the inter-Cell counterpart of the VLC architecture, allows for a lower latency cache-to-cache data transfer between Cells.

Information containing the location and state of cached data is required for the CPU to access the specific data stored in cache memory. By accessing the cache memory according to this information, the CPU is able to retrieve the desired data.

Two main mechanisms exist for cache-to-cache data transfer between Cells, directory based and TAG based cache coherency. The cache information, described above, is stored in external memory (DIR memory) for the directory based, and within the chipset for the TAG based mechanisms.

The benefit of the TAG based mechanism, thus implemented in the Express5800/1000 series server, is that by accessing the TAG, unnecessary inquiries to the cache memory are filtered for a smoother transfer of data. Furthermore, the Express5800/1000 series server includes a dedicated high-speed cache coherency interface (CCI) which is used to connect the Cells directly to one another without using a crossbar. This interface is used for broadcasting and other cache coherency transactions to allow for even faster cache-to-cache data transfer.

Tag Based Cache Coherency

A3 Chipset

Performance

Request is broadcasted to all CPU

CPU

chip

CPU

chip

CPU

increase with

set

the A3 chipset

simultaneously

TAG

CPU

CPU CPU CPU

CPU

CPU CPU CPU

CPU

CPU CPU CPU

Directory Based Cache Coherency

chip

Memory

chip

Memory

chip

Memory

CPU

chip

Memory

chip

CPU

chip

CPU

set

TAG

DIR

In a directory based system, the requestor CPU will first access the external memory to confirm the location of the cached data, and then will access the appropriate cache memory. On the other hand, in a TAG based system, the requestor CPU broadcasts a request to all other cache simultaneously via TAG.

The Express5800/1000 Series server implements a dedicated connection (CCI) for snooping

Directory Based Cache Coherency

Access Directory to confirm the location of the data first, then access the appropriate cache memory

CPU	CPU	CPU	CPU	CPU	CPU	CPU	CPU	CPU	CPU	CPU	CPU
chip		Memory		chip		Memory		chip		Memory
set		Memory		set		Memory		set		Memory
set				set				set
			DIR				DIR				DIR

CPU

Memory

TAG

DIR

CPU requesting the information

CPU storing the newest information

Memory that is storing location regarding the memory

TAG memory (Manages cache line information for all of the CPUs loaded on a CELL card)

DIR Memory (Manages cache line information for all of the memory loaded on a CELL card)

Crossbar-less configuration

Improved data transfer latency through direct attached Cell configuration

Within the Express5800/1000 series server lineup, the 1080Rf has been able to lower the data transfer latency by removing the crossbar and directly connecting Cell to Cell, and Cell to PCI box.

Even with the crossbar-less configuration, virtualization of the Cell card and I/O box has been retained as not to diminish computing and I/O resources.

NEC 1000 Series, 5800 Series, 1080Rf VLC Architecture, Dedicated Cache Coherency Interface CCI

Models: 5800 Series 1000 Series 1320Xf/1160Xf 1080Rf

VLC Architecture

Very Large Cache (VLC) Architecture

FSB

Split BUS Architecture

Dedicated Cache Coherency Interface (CCI)

Tag Based Cache Coherency

A3 Chipset

simultaneously

Directory Based Cache Coherency

Crossbar-less configuration

5800 Series, 1000 Series, 1320Xf/1160Xf, 1080Rf specifications