40555 Rev. 3.00 June 2006 | Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ |
| ccNUMA Multiprocessor Systems |
However, as shown in Figure 11 on page 31, when both threads are
Total Time for both threads
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
| 147% | 126% | 125% | 136% |
| |
|
|
|
| |||
|
|
|
|
| ||
|
| 0 | 0 Hop | 0 Hop | 0 Hop |
|
|
|
| ||||
|
|
| ||||
|
| Hop | 1 Hop | 1 Hop | 2 Hop |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.0.w.0 0.1.w.0 (0 Hops) (0 Hops)
0.0.w.0 0.1.w.1 (0 Hops) (1 Hops)
0.0.w.0 0.1.w.2 (0 Hops) (1 Hops)
0.0.w.0 0.1.w.3 (0 Hops) (2 Hops)
Figure 11. Both
When a single thread reads locally, it generates a memory bandwidth load of 1.64 GB/s. Assuming a sustained memory bandwidth of 70% of the theoretical maximum of 6.4 GB/s (PC3200 DDR memory), the cumulative bandwidth demanded by two
However, when a single thread writes locally it generates a memory bandwidth load of 2.98 GB/s. This is because each write in this test case results in a cache line eviction and thus generates twice the memory traffic generated by a read. The cumulative memory bandwidth demanded by 2
It is useful to study whether this observation is also applicable under a variable background load.
One would expect that, if the memory bandwidth demanded of the remote node were increased, at some point the 0
0 hop-0 hop case for the write-only threads.
The same two
•Both threads access local memory.
•First thread accesses local memory and second thread accesses memory that is remote by one hop.
•First thread accesses local memory and second thread access memory that is remote by two hops.
Chapter 3 | Analysis and Recommendations | 31 |