The average memory read latency on the dual-core Itanium 2 processor will appear greater than on previous Itanium 2 processors. This is because the reported latency also includes the latency that the arbiter adds to both the outbound request and inbound data transfer.

Avg Outstand

Average number of outstanding reads per cycle gives some idea of the memory request density, that is, the probability of one or more memory requests per cycle. For control-dominated code or for workloads that seldom miss the internal caches, this value will be very small. For data-flow-type workloads, this number can, if extensive prefetching is employed, be quite high, up to a maximum of 16, which is the Itanium 2 bus limit.

The reported average latency value will be incorrect on Itanium 2 steppings earlier than B2.

CPU

CPU transaction component is a measure of the percentage of all bus transactions generated by all CPUs on a shared front side bus (FSB).

I/O

I/O transaction component is a measure of the percentage of all bus transactions initiated by any I/O agent on a shared FSB.

Util Adrs

Average address bus utilization gives an estimate of total address bus utilization resulting from all bus transactions to include cache misses, I/O port reads/writes, interprocessor interrupts, writebacks, cache line invalidates (FC instruction, store hit on shared line), and clean castouts (if enabled). The utilization is computed as follows:

ADRS UTIL = 100.0 * (total transactions/sec * 3.0) / bus cycles/sec

The constant value (3.0) is the number of address cycles needed for each bus transaction.

Util Data

Data bus utilization gives a lower bound estimate of total data bus utilization resulting from bus transactions that result in a data transfer, that is, BRL, BRIL, BWL, and nonzero byte BRP/BWP transactions. A lower bound data bus utilization is computed as follows:

DATA BUS CYCLES/SEC = ((BRL + BRIL + BWL + IMPLICIT WB)/sec * 4.0)

+

((nonzero byte BRP's/BWP's)/sec * 1.0)

DATA UTIL = 100 * (DATA BUS CYCLES/SEC) / BUS CYCLES SEC

The constants (4.0 and 1.0) represent the number of cycles that the data bus is occupied to perform the requisite data transfer. All cache line transfers (brl, bril, bwl) require four cycles. The nonzero BRP's/BWP's require one or two cycles (16, 32, 64 bytes). Since most of the nonzero BRP's/BWP's are to I/O ports and semaphores, it was decided to assume a single-cycle transfer. Thus, there is a small possibility of undercounting cycles.

BRL

Bus Read Line is the transaction used to read cache lines, due either to an instruction cache miss or to a load data miss.

BRIL

Bus Read Invalidate Line is the transaction used when a store miss occurs, thus a read for ownership. In Itanium 2, this transaction is also used when a store hit occurs on a shared line. In this case, the BRIL is used to invalidate all remote copies on this cache line and have the memory controller return the line we already have to the cache. Itanium 2 does not implement the BIL optimization, which would have allowed remote copies to be invalidated without performing a superfluous memory request.

248 Event Set Descriptions for CPU Metrics

Page 248
Image 248
HP UX Caliper Software manual Cpu, Brl, Bril