AMD 64 manual Chapter, Analysis and Recommendations

Models: 64

1 48
Download 48 pages 55.63 Kb
Page 23
Image 23
Manual background

40555 Rev. 3.00 June 2006

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

 

ccNUMA Multiprocessor Systems

A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is enough physical memory available on that node. If enough physical memory is not available on the node, then various advanced techniques are used to determine where to place the data, depending on the OS.

Data once placed on a node due to first touch normally resides on that node for its lifetime. However, the OS scheduler can migrate the thread that first touched the data from one core to another core— even to a core on a different node. This can be done by the OS for the purpose of load balancing [3].

This migration has the effect of moving the thread farther from its data. Some schedulers try to bring the thread back to a core on a node where the data is in local memory, but this is never guaranteed. Furthermore, the thread could first touch more data on the node to which it was moved before it is moved back. This is a difficult problem for the OS to resolve, since it has no prior information as to how long the thread will run and, hence, whether migrating it back is desirable or not.

If an application demonstrates that threads are being moved away from their associated memory by the scheduler, it is typically useful to explicitly set thread placement. By explicitly pinning a thread to a node, the application can tell the OS to keep the thread on that node and, thus, keep data accessed by the thread local to it by the virtue of first touch.

The performance improvement obtained by explicit thread placement may vary depending on whether the application is multithreaded, whether it needs more memory than available on a node, whether threads are being moved away from their data, etc.

In some cases, where threads are scheduled from the outset on a core that is remote from their data, it might be useful to explicitly control the data placement. This is discussed in detail in the Section 3.2.2.

The previously discussed tools and APIs for explicitly controlling thread placement can also be used for explicitly controlling data placement. For additional details on thread and memory placement tools and API in various OS, refer to Section A.7 on page 44.

3.2.2Data Placement Techniques to Alleviate Unnecessary Data Sharing Between Nodes Due to First Touch

When data is shared between threads running on different nodes, the default policy of local allocation by first touch used by the OS can become non-optimal.

For example, a multithreaded application may have a startup thread that sets up the environment, allocates and initializes a data structure and forks off worker threads. As per the default local allocation policy, the data structure is placed on physical memory of the node where the start up thread did the first touch. The forked worker threads are spread around by the scheduler to be balanced across all nodes and their cores. A worker thread starts accessing the data structure remotely from the memory on the node where the first touch occurred. This could lead to significant memory and HyperTransport traffic in the system. This makes the node where the data resides the bottleneck. This situation is especially bad for performance if the startup thread only does the initialization and

Chapter 3

Analysis and Recommendations

23

Page 23
Image 23
AMD 64 manual Chapter, Analysis and Recommendations