AMD 64 manual hop-0 hop case for the write-only threads, Chapter, Analysis and Recommendations

Models: 64

1 48
Download 48 pages 55.63 Kb
Page 31
Image 31
0 hop-0 hop case for the write-only threads.

40555 Rev. 3.00 June 2006

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

 

ccNUMA Multiprocessor Systems

However, as shown in Figure 11 on page 31, when both threads are write-only, the 0 hop-1 hop and 0 hop-2 hop cases are faster than the 0 hop-0 hop case.

Total Time for both threads (write-write)

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

 

147%

126%

125%

136%

 

 

 

 

 

 

 

 

 

 

 

 

0

0 Hop

0 Hop

0 Hop

 

 

 

 

 

 

 

 

 

Hop

1 Hop

1 Hop

2 Hop

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Manual backgroundManual background 0.0.w.0 0.1.w.0 (0 Hops) (0 Hops)

Manual backgroundManual background 0.0.w.0 0.1.w.1 (0 Hops) (1 Hops)

Manual backgroundManual background 0.0.w.0 0.1.w.2 (0 Hops) (1 Hops)

Manual backgroundManual background 0.0.w.0 0.1.w.3 (0 Hops) (2 Hops)

Figure 11. Both Write-Only Threads Running on Node 0 (Different Cores) on an Idle System

When a single thread reads locally, it generates a memory bandwidth load of 1.64 GB/s. Assuming a sustained memory bandwidth of 70% of the theoretical maximum of 6.4 GB/s (PC3200 DDR memory), the cumulative bandwidth demanded by two read-only threads does not exceed the sustained memory bandwidth on that node and hence the local or 0 hop-0 hop case is the fastest.

However, when a single thread writes locally it generates a memory bandwidth load of 2.98 GB/s. This is because each write in this test case results in a cache line eviction and thus generates twice the memory traffic generated by a read. The cumulative memory bandwidth demanded by 2 write-only threads now exceeds the sustained memory bandwidth on that node. The 0 hop-0 hop case now incurs the penalty of saturating the memory bandwidth on that node. For detailed analysis, refer to Section A.4 on page 42.

It is useful to study whether this observation is also applicable under a variable background load.

One would expect that, if the memory bandwidth demanded of the remote node were increased, at some point the 0 hop-1 hop case would become as slow as, and perhaps slower than, the

0 hop-0 hop case for the write-only threads.

The same two write-only threads as before are running on node 0, going though the following cases:

Both threads access local memory.

First thread accesses local memory and second thread accesses memory that is remote by one hop.

First thread accesses local memory and second thread access memory that is remote by two hops.

Chapter 3

Analysis and Recommendations

31

Page 31
Image 31
AMD 64 manual hop-0 hop case for the write-only threads, Chapter, Analysis and Recommendations