Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ | 40555 Rev. 3.00 June 2006 |
ccNUMA Multiprocessor Systems |
|
afterwords no longer needs the data structure and if only one of the worker threads needs the data structure. In other words, the data structure is not truly shared between the worker threads.
It is best in this case to use a data initialization scheme that avoids incorrect data placement due to first touch. This is done by allowing each worker thread to first touch its own data or by explicitly pinning the data associated with each worker thread on the node where the worker thread runs.
Certain OSs provide memory placement tools and APIs that also permit data migration. A worker thread can use these to migrate the data from the node where the start up thread did the first touch to the node where the worker thread needs it. There is a cost associated with the migration and it would be less efficient than using the correct data initialization scheme in the first place.
If it is not possible to modify the application to use a correct data initialization scheme or if data is truly being shared by the various worker
Let us assume that the data structure shared between the worker threads in this case is of size 16 KB. If the default policy of local allocation is used then the entire 16KB data structure resides on the node where the startup thread does first touch. However, using the policy of node interleaving, the
The tools and APIs that support explicit thread and memory placement mentioned in the previous sections can also be used by an application to use the node interleaving policy for its memory. For additional details refer to Section A.8 on page 46.
By default, the granularity of interleaving offered by the tools/APIs is usually set to the size of the virtual page supported by the hardware, which is 4K (when system is configured for normal pages, which is the default) and 2M ((when system is configured for large pages,). Therefore any benefit from node interleaving will only be obtained if the data being accessed is significantly larger than a virtual page size.
If data is being accessed by three or more cores, then it is better to interleave data across the nodes that access the data than to leave it resident on a single node. We anticipate that using this rule of thumb could give a significant performance improvement. However, developers are advised to experiment with their applications to measure any performance change.
A good example of of the use of node interleaving is observed with Spec JBB 2005 using Sun JVM
24 | Analysis and Recommendations | Chapter 3 |