misses a particular instruction incurred, but, instead, as an indication of which instructions incur the most data cache misses.

You can potentially get a rough estimate of the total number of data cache misses incurred by a particular instruction, for example, by doing the following:

1.Determine a scaling factor based on total misses and number of misses accounted for by sampling:

scale = total L1 misses / (total sampled misses * sampling rate)

2.Multiply the number of sampled misses associated with an instruction by the scaling factor: total misses for instruction = scale * sampled misses for instruction

However, depending on the density of floating-point load misses incurred by your application, such estimates could be very misleading.

Floating-point loads are serviced directly from the L2 cache. The PMU treats both L1 data cache misses and L2 floating-point load misses as data cache miss events for sampling purposes. Therefore, if your application makes frequent floating-point loads, then multiplying total samples by sampling rate might yield a data cache miss count that exceeds the total number of L1 data cache misses.

More frequent sampling increases HP Caliper's perturbation of your application. In the extreme case of taking one sample for each cache miss event, the kernel will trap on every event, making the resulting data of limited, if any, value.

How Latency Bucket Metrics Are Obtained

The PMU's data event address register (D-EAR)provides the number of cycles of latency for each sampled miss. HP Caliper places a data cache miss into one of the latency buckets based on the latency of the miss. HP Caliper uses its built-in table of expected latencies to determine whether a miss is serviced by the L2 cache, L3 cache, cell local memory, C2C, 1–hop memory, 2–hop memory, and so forth. HP Caliper uses different expected latencies depending on the CPU type, CPU frequency, and system model.

How the Data Summary Information Is Obtained

The PMU's data event address register (D-EAR)provides the data address along with the number of cycles of latency for each sampled data cache miss. HP Caliper creates a histogram of samples by data addresses, by aggregating all samples falling into the same data address. After creating such a histogram, the data addresses are mapped to global variables. All samples whose data addresses belong to the same global variable are aggregated. If a data address does not belong to any global variable, it is assigned to a region in the process. HP Caliper creates a map of different regions within a process. This map is used to assign sample data addresses to a process region.

dtlb Measurement Report Description

With the dtlb measurement, produced by the dtlb measurement configuration file, HP Caliper measures and reports two levels of information:

Exact counts of data translation lookaside buffer (TLB) metrics summed across the entire run of an application.

Sampled data TLB metrics that are associated with particular locations in the measured application. Data TLB misses can hit the L2 TLB, can be handled by the hardware page walker (HPW), or can be handled by software.

The report shows measured data by thread, load module, function, statement, and instruction.

Command-line options allow you to control the amount of data reported, how the data is sorted, and the number of statements and instructions reported for each sampled program location.

Example Command Line for Text Report

$ caliper dtlb -o reports/dtlbm.txt ./wordplay thequickbrownfox

dtlb Measurement Report Description 193