Intel® IXP42X product line and IXC1100 control plane processors—Intel XScale® Processor
Intel® IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor
DM September 2006
188 Order Number: 252480-006US
3.10.4.4.6 Cache Blocking
Cache blocking techniques, such as strip-mining, are used to improve temporal locality
of the data. Given a large data set that can be reused across multiple passes of a loop,
data blocking divides the data into smaller chunks which can be loaded into the cache
during the first loop and then be available for processing on subsequence loops thus
minimizing cache misses and reducing bus traffic.
As an example of cache blocking consider the following code:
The variable A[i][k] is completely reused. However, accessing C[j][k] in the j and k
loops can displace A[i][j] from the cache. Using blocking the code becomes:
3.10.4.4.7 Prefetch Unrolling
When iterating through a loop, data transfer latency can be hidden by prefetching
ahead one or more iterations. The solution incurs an unwanted side affect that the final
interactions of a loop loads useless data into the cache, polluting the cache, increasing
bus traffic and possibly evicting valuable temporal data. This problem can be resolved
by prefetch unrolling. For example consider:
Interactions i-1 and i, will prefetch superfluous data. The problem can be avoid by
unrolling the end of the loop.
struct employee {
struct employee *prev;
struct employee *next;
int ssno;
int empid;
float Year2DatePay;
float Year2DateTax;
float Year2Date401KDed;
float Year2DateOtherDed;
};
for(i=0; i<10000; i++)
for(j=0; j<10000; j++)
for(k=0; k<10000; k++)
C[j][k] += A[i][k] * B[j][i];
for(i=0; i<10000; i++)
for(j1=0; j<100; j++)
for(k1=0; k<100; k++)
for(j2=0; j<100; j++)
for(k2=0; k<100; k++)
{j = j1 * 100 + j2;
k = k1 * 100 + k2;
C[j][k] += A[i][k] * B[j][i];
}
for(i=0; i<NMAX; i++)
{prefetch(data[i+2]);
sum += data[i];
}