data in the same cache block as the lock.
For the
Therefore, accesses to shared data, including locks, should be minimized. For example, a
Array manipulation should be partitioned across processors so that cache blocks do not thrash between processors. Having each of four processors work on every fourth array element severely impairs performance on any implementation with a cache block of four elements or larger. The processors all contend for copies of the same cache blocks and use only one quar- ter of the data in each block. Writes in one processor severely impair cache performance on all processors.
A better decomposition is to give each processor the largest possible contiguous chunk of data to work on (N/4 consecutive rows for four processors and
Similarly,
Implementors should give first priority to an efficient
Implementors should assume that the amount of shared data will continue to increase, so over time the need for efficient sharing implementations will also increase.
A.3.3 Avoiding Cache/TB Conflicts — Factor of 1
Occasionally, programs that run with a
Note:
No Alpha processor through and including the 21264 has implemented a