40555 Rev. 3.00 June 2006

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

 

ccNUMA Multiprocessor Systems

Chapter 4 Conclusions

The single most important recommendation for most applications is to keep data local on node where it is being accessed. As long as a thread initializes the data it needs, in other words writes to it for the first time, a ccNUMA aware OS will typically keep the data local on the node where the thread runs. By allowing local allocation on first touch policy for data placement, a ccNUMA aware OS makes the task of data placement transparent and easy for most developers.

In some cases, if an application demonstrates symptoms that its threads are being moved away from their data, it might be useful to explicitly pin the thread to a specific node. Several ccNUMA-aware OSs offer tools and APIs to influence thread placement. Typically an OS scheduler uses load balancing schemes to make decisions on where to place threads. Using these tools or APIs will override the scheduler and hand over the control for thread placement to the developer. . The developer should do reasonable scheduling with these tools or APIs by adhering to the following guidelines:

When scheduling threads that mostly access independent data on an idle dual-core AMD multiprocessor system, first schedule threads to an idle core of each node until all nodes are exhausted and then move on to the other idle core of each node. In other words, schedule using node major order first followed by core major order.

When scheduling multiple threads that mostly share data with each other on an idle dual-core AMD multiprocessor system, schedule threads on both cores of an idle node first and then move on to the next idle node and so on. In other words, schedule using core major order first followed by node major order.

By default, a ccNUMA-aware OS uses the local allocation on first touch policy for data placement. Normally practical, this policy can become suboptimal if a thread first touches data on one node that it subsequently no longer needs, but which some other thread later accesses from a different node. It is best to change the data initialization scheme so that a thread initializes the data it needs and does not rely on any other thread to do the initialization. Several ccNUMA aware OSs offer tools and APIs to influence data placement. Using the tools or API will override the default local allocation by first touch policy and hand over the control for data placement to the developer.

If it is not possible to change the data initialization scheme or if the data is truly shared by threads running on different nodes, then a technique called node interleaving of memory can be used. The use

of node interleaving on data is recommended when the data resides on a single node and is accessed by three or more cores. The interleaving should be performed on the nodes from which the data is accessed and should only be used when the data accessed is significantly larger than 4K (when the system is configured for normal pages, which is the default) or 2M (when the system is configured for large pages). Developers are advised to experiment with their applications to gauge the performance change due to node interleaving. For additional details on the tools and APIs offered by various OS for node interleaving, refer to Section A.8 on page 46.

Chapter 4

Conclusions

37

Page 37
Image 37
AMD 64 manual Chapter Conclusions

64 specifications

AMD64 is a 64-bit architecture developed by Advanced Micro Devices (AMD) as an extension of the x86 architecture. Introduced in the early 2000s, it aimed to offer enhanced performance and capabilities to powering modern computing systems. One of the main features of AMD64 is its ability to address a significantly larger amount of memory compared to its 32-bit predecessors. While the old x86 architecture was limited to 4 GB of RAM, AMD64 can theoretically support up to 16 exabytes of memory, making it ideal for applications requiring large datasets, such as scientific computing and complex simulations.

Another key characteristic of AMD64 is its support for backward compatibility. This means that it can run existing 32-bit applications seamlessly, allowing users to upgrade their hardware without losing access to their existing software libraries. This backward compatibility is achieved through a mode known as Compatibility Mode, enabling users to benefit from both newer 64-bit applications and older 32-bit applications.

AMD64 also incorporates several advanced technologies to optimize performance. One such technology is the support for multiple cores and simultaneous multithreading (SMT). This allows processors to handle multiple threads concurrently, improving overall performance, especially in multi-tasking and multi-threaded applications. With the rise of multi-core processors, AMD64 has gained traction in both consumer and enterprise markets, providing users with an efficient computing experience.

Additionally, AMD64 supports advanced vector extensions (AVX), which enhance the capability of processors to perform single instruction, multiple data (SIMD) operations. This is particularly beneficial for tasks such as video encoding, scientific simulations, and cryptography, allowing these processes to be executed much faster, thereby increasing overall throughput.

Security features are also integrated within AMD64 architecture. Technologies like AMD Secure Execution and Secure Memory Encryption help protect sensitive data and provide an enhanced security environment for virtualized systems.

In summary, AMD64 is a powerful and versatile architecture that extends the capabilities of x86, offering enhanced memory addressing, backward compatibility, multi-core processing, vector extensions, and robust security features. These innovations have positioned AMD as a strong competitor in the computing landscape, catering to the demands of modern users and applications. The continuous evolution of AMD64 technology demonstrates AMD's commitment to pushing the boundaries of computing performance and efficiency.