AMD 64 manual Analysis and Recommendations

Page 24

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

40555 Rev. 3.00 June 2006

ccNUMA Multiprocessor Systems

 

afterwords no longer needs the data structure and if only one of the worker threads needs the data structure. In other words, the data structure is not truly shared between the worker threads.

It is best in this case to use a data initialization scheme that avoids incorrect data placement due to first touch. This is done by allowing each worker thread to first touch its own data or by explicitly pinning the data associated with each worker thread on the node where the worker thread runs.

Certain OSs provide memory placement tools and APIs that also permit data migration. A worker thread can use these to migrate the data from the node where the start up thread did the first touch to the node where the worker thread needs it. There is a cost associated with the migration and it would be less efficient than using the correct data initialization scheme in the first place.

If it is not possible to modify the application to use a correct data initialization scheme or if data is truly being shared by the various worker threads—as in a database application—then a technique called node interleaving can be used to improve performance. Node interleaving allows for memory to be interleaved across any subset of nodes in the multiprocessor system. When the node interleaving policy is used, it overrides the default local allocation policy used by the OS on first touch.

Let us assume that the data structure shared between the worker threads in this case is of size 16 KB. If the default policy of local allocation is used then the entire 16KB data structure resides on the node where the startup thread does first touch. However, using the policy of node interleaving, the 16-KB data structure can be interleaved on first touch such that the first 4KB ends up on node 0, the next 4KB ends up on node 1, and the next 4KB ends up on node 2 and so on. This assumes that there is enough physical memory available on each node. Thus, instead of having all memory resident on a single node and making that the bottleneck, memory is now spread out across all nodes.

The tools and APIs that support explicit thread and memory placement mentioned in the previous sections can also be used by an application to use the node interleaving policy for its memory. For additional details refer to Section A.8 on page 46.

By default, the granularity of interleaving offered by the tools/APIs is usually set to the size of the virtual page supported by the hardware, which is 4K (when system is configured for normal pages, which is the default) and 2M ((when system is configured for large pages,). Therefore any benefit from node interleaving will only be obtained if the data being accessed is significantly larger than a virtual page size.

If data is being accessed by three or more cores, then it is better to interleave data across the nodes that access the data than to leave it resident on a single node. We anticipate that using this rule of thumb could give a significant performance improvement. However, developers are advised to experiment with their applications to measure any performance change.

A good example of of the use of node interleaving is observed with Spec JBB 2005 using Sun JVM 1.5.0_04-GA. Using node interleaving improved the peak throughput score reported by Spec JBB 2005 by 8%. We observe that, as this benchmark starts with a single thread and then ramps up to eight threads, all threads end up accessing memory resident on a single node by the virtue of first touch.

24

Analysis and Recommendations

Chapter 3

Image 24
Contents Application Note Advanced Micro Devices, Inc. All rights reserved Contents Performance Guidelines for AMD Athlon 64 and AMD Opteron List of Figures List of FiguresList of Figures Revision History Revision HistoryRevision History Introduction Chapter IntroductionRelated Documents Chapter Introduction Introduction Experimental Setup Chapter Experimental SetupSystem Used Quartet Topology Synthetic Test Internal Resources Associated with a Quartet NodeData Access Rate Qualifiers Reading and Interpreting Test Graphs Axis DisplayLabels Used Analysis and Recommendations Scheduling ThreadsMultiple Threads-Independent Data Chapter Analysis and RecommendationsData Locality Considerations Multiple Threads-Shared DataScheduling on a Non-Idle System Hop Keeping Data Local by Virtue of first Touch Chapter Analysis and Recommendations Analysis and Recommendations Threads access local data Avoid Cache Line SharingCommon Hop Myths Debunked Myth All Equal Hop Cases Take Equal TimeHop Hop Hop Myth Greater Hop Distance Always Means Slower Time 102% 108% 107% 147% 126% 125% 136% 145% 136% 127% 126% 146% 129% 139% Locks Performance Guidelines for AMD Athlon 64 and AMD Opteron Analysis and Recommendations Conclusions Chapter ConclusionsConclusions Appendix a Description of the Buffer QueuesAppendix a Appendix a What Role Do Buffers Play in the Throughput Observed? Performance Guidelines for AMD Athlon 64 and AMD Opteron Appendix a Support Under Linux Controlling Process and Thread AffinitySupport under Solaris Support under Microsoft WindowsMicrosoft Windows does not offer node interleaving Node Interleaving Configuration in the Bios CcNUMA Multiprocessor Systems Appendix a
Related manuals
Manual 6 pages 48.71 Kb Manual 3 pages 48.71 Kb Manual 2 pages 13.98 Kb

64 specifications

AMD64 is a 64-bit architecture developed by Advanced Micro Devices (AMD) as an extension of the x86 architecture. Introduced in the early 2000s, it aimed to offer enhanced performance and capabilities to powering modern computing systems. One of the main features of AMD64 is its ability to address a significantly larger amount of memory compared to its 32-bit predecessors. While the old x86 architecture was limited to 4 GB of RAM, AMD64 can theoretically support up to 16 exabytes of memory, making it ideal for applications requiring large datasets, such as scientific computing and complex simulations.

Another key characteristic of AMD64 is its support for backward compatibility. This means that it can run existing 32-bit applications seamlessly, allowing users to upgrade their hardware without losing access to their existing software libraries. This backward compatibility is achieved through a mode known as Compatibility Mode, enabling users to benefit from both newer 64-bit applications and older 32-bit applications.

AMD64 also incorporates several advanced technologies to optimize performance. One such technology is the support for multiple cores and simultaneous multithreading (SMT). This allows processors to handle multiple threads concurrently, improving overall performance, especially in multi-tasking and multi-threaded applications. With the rise of multi-core processors, AMD64 has gained traction in both consumer and enterprise markets, providing users with an efficient computing experience.

Additionally, AMD64 supports advanced vector extensions (AVX), which enhance the capability of processors to perform single instruction, multiple data (SIMD) operations. This is particularly beneficial for tasks such as video encoding, scientific simulations, and cryptography, allowing these processes to be executed much faster, thereby increasing overall throughput.

Security features are also integrated within AMD64 architecture. Technologies like AMD Secure Execution and Secure Memory Encryption help protect sensitive data and provide an enhanced security environment for virtualized systems.

In summary, AMD64 is a powerful and versatile architecture that extends the capabilities of x86, offering enhanced memory addressing, backward compatibility, multi-core processing, vector extensions, and robust security features. These innovations have positioned AMD as a strong competitor in the computing landscape, catering to the demands of modern users and applications. The continuous evolution of AMD64 technology demonstrates AMD's commitment to pushing the boundaries of computing performance and efficiency.