Chapter Analysis and Recommendations

40555 Rev. 3.00 June 2006	Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
	ccNUMA Multiprocessor Systems

A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is enough physical memory available on that node. If enough physical memory is not available on the node, then various advanced techniques are used to determine where to place the data, depending on the OS.

Data once placed on a node due to first touch normally resides on that node for its lifetime. However, the OS scheduler can migrate the thread that first touched the data from one core to another core— even to a core on a different node. This can be done by the OS for the purpose of load balancing [3].

This migration has the effect of moving the thread farther from its data. Some schedulers try to bring the thread back to a core on a node where the data is in local memory, but this is never guaranteed. Furthermore, the thread could first touch more data on the node to which it was moved before it is moved back. This is a difficult problem for the OS to resolve, since it has no prior information as to how long the thread will run and, hence, whether migrating it back is desirable or not.

If an application demonstrates that threads are being moved away from their associated memory by the scheduler, it is typically useful to explicitly set thread placement. By explicitly pinning a thread to a node, the application can tell the OS to keep the thread on that node and, thus, keep data accessed by the thread local to it by the virtue of first touch.

The performance improvement obtained by explicit thread placement may vary depending on whether the application is multithreaded, whether it needs more memory than available on a node, whether threads are being moved away from their data, etc.

In some cases, where threads are scheduled from the outset on a core that is remote from their data, it might be useful to explicitly control the data placement. This is discussed in detail in the Section 3.2.2.

The previously discussed tools and APIs for explicitly controlling thread placement can also be used for explicitly controlling data placement. For additional details on thread and memory placement tools and API in various OS, refer to Section A.7 on page 44.

3.2.2Data Placement Techniques to Alleviate Unnecessary Data Sharing Between Nodes Due to First Touch

When data is shared between threads running on different nodes, the default policy of local allocation by first touch used by the OS can become non-optimal.

For example, a multithreaded application may have a startup thread that sets up the environment, allocates and initializes a data structure and forks off worker threads. As per the default local allocation policy, the data structure is placed on physical memory of the node where the start up thread did the first touch. The forked worker threads are spread around by the scheduler to be balanced across all nodes and their cores. A worker thread starts accessing the data structure remotely from the memory on the node where the first touch occurred. This could lead to significant memory and HyperTransport traffic in the system. This makes the node where the data resides the bottleneck. This situation is especially bad for performance if the startup thread only does the initialization and

Chapter 3 Analysis and Recommendations

64 specifications

AMD64 is a 64-bit architecture developed by Advanced Micro Devices (AMD) as an extension of the x86 architecture. Introduced in the early 2000s, it aimed to offer enhanced performance and capabilities to powering modern computing systems. One of the main features of AMD64 is its ability to address a significantly larger amount of memory compared to its 32-bit predecessors. While the old x86 architecture was limited to 4 GB of RAM, AMD64 can theoretically support up to 16 exabytes of memory, making it ideal for applications requiring large datasets, such as scientific computing and complex simulations.

Another key characteristic of AMD64 is its support for backward compatibility. This means that it can run existing 32-bit applications seamlessly, allowing users to upgrade their hardware without losing access to their existing software libraries. This backward compatibility is achieved through a mode known as Compatibility Mode, enabling users to benefit from both newer 64-bit applications and older 32-bit applications.

AMD64 also incorporates several advanced technologies to optimize performance. One such technology is the support for multiple cores and simultaneous multithreading (SMT). This allows processors to handle multiple threads concurrently, improving overall performance, especially in multi-tasking and multi-threaded applications. With the rise of multi-core processors, AMD64 has gained traction in both consumer and enterprise markets, providing users with an efficient computing experience.

Additionally, AMD64 supports advanced vector extensions (AVX), which enhance the capability of processors to perform single instruction, multiple data (SIMD) operations. This is particularly beneficial for tasks such as video encoding, scientific simulations, and cryptography, allowing these processes to be executed much faster, thereby increasing overall throughput.

Security features are also integrated within AMD64 architecture. Technologies like AMD Secure Execution and Secure Memory Encryption help protect sensitive data and provide an enhanced security environment for virtualized systems.

In summary, AMD64 is a powerful and versatile architecture that extends the capabilities of x86, offering enhanced memory addressing, backward compatibility, multi-core processing, vector extensions, and robust security features. These innovations have positioned AMD as a strong competitor in the computing landscape, catering to the demands of modern users and applications. The continuous evolution of AMD64 technology demonstrates AMD's commitment to pushing the boundaries of computing performance and efficiency.