Hardware Overview 43
cache lines (since any particular line from memory may only be loaded into a
cache line in the same congruence class). This effect happens with strides
that are a multiple of a power of two.
The POWER3 cache, with its much greater degree of set associativity, is
much less susceptible to this problem than the POWER2 cache. Strides of
multiples of 1024 bytes will cause all the data to be in the same congruence
class but will only cause a reduction in apparent cache size of a factor of 4.
Odd multiples of 512 will halve this effective size. This is, however, minor
compared with the possible reduction by a factor of 128 on POWER2.
2.5.1.3 Effects of L1 Cache Misses on Performance
The effects of cache misses on performance for POWER3 (the Model 260)
can be illustrated in Figure 15. A load instruction is to be performed to a
floating-point register (assume data is already resident in memory).
Figure 15. Loading Instructions from Memory to a Floating-Point Register
If the data is in memory only, it takes about 36 cycles; if it is in L1 cache, it
takes one cycle. That is, the cost of a cache miss to memory is 35 cycles. The
same timing applies to storing data from registers into memory. If the store is
L1 cache?
Is the data in L2 cache?
No
Yes
Load in Floating-
Point Register
(Takes 1 cycle)
No
Load in Floating-
Point Register
(Takes 6-7 cycles)
Is the data in
Data in memory
only. Load in
Floating-Point
Register.
(Takes 36 cycles)
Yes