Data Locality Considerations, Multiple Threads-Shared Data

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™	40555 Rev. 3.00 June 2006
ccNUMA Multiprocessor Systems

3.1.2Multiple Threads-Shared Data

When scheduling multiple threads that share data on an idle system, it is preferable to schedule the threads on both cores of an idle node first, then on both cores of the the next idle node, and so on. In other words, schedule using core major order first followed by node major order.

For example, when scheduling threads that share data on a dual-core Quartet system, AMD recommends using the following order:

•Core 0 and core 1 on node 0 in any order

•Core 0 and core 1 on node 1 in any order

•Core 0 and core 1 on node 2 in any order

•Core 0 and core 1 on node 3 in any order

3.1.3Scheduling on a Non-Idle System

Scheduling multiple threads for an application optimally on a non-idle system is a more difficult task. It requires that the application make global holistic decisions about machine resources, coordinate itself with other applications already running, and balance decisions between them. In such cases, it is better to rely on the OS to do the appropriate load balancing [2].

In general, most developers will achieve good performance by relying on the ccNUMA-aware OS to make the right scheduling decisions on idle and non-idle systems. For additional details on ccNUMA scheduler support in various operating systems, refer to Section A.6 on page 43.

In addition to the scheduler, several NUMA-aware OSs provide tools and application programming interfaces (APIs) that allow the developer to explicitly set thread placement to a certain core or node. Using these tools or APIs overrides the scheduler and hands over control for thread placement to the developer, who should use the previously mentioned techniques to assure reasonable scheduling.

For additional details on the tools and API libraries supported in various OSs, refer to Section A.7 on page 44.

3.2Data Locality Considerations

It is best to keep data local to the node from which it is being accessed. Accessing data remotely is slower than accessing data locally. The further the hop distance to the data, the greater the cost of accessing remote memory. For most memory-latency sensitive applications, keeping data local is the single most important recommendation to consider.

As explained in Section 2.1 on page page 13, if a thread is running and accessing data on the same node, it is considered as a local access. If a thread is running on one node but accessing data resident on a different node, it is considered as a remote access. If the node where the thread is running and the node where the data is resident are directly connected to each other, it is considered as a 1 hop access

Analysis and Recommendations

Chapter 3

64 specifications

AMD64 is a 64-bit architecture developed by Advanced Micro Devices (AMD) as an extension of the x86 architecture. Introduced in the early 2000s, it aimed to offer enhanced performance and capabilities to powering modern computing systems. One of the main features of AMD64 is its ability to address a significantly larger amount of memory compared to its 32-bit predecessors. While the old x86 architecture was limited to 4 GB of RAM, AMD64 can theoretically support up to 16 exabytes of memory, making it ideal for applications requiring large datasets, such as scientific computing and complex simulations.

Another key characteristic of AMD64 is its support for backward compatibility. This means that it can run existing 32-bit applications seamlessly, allowing users to upgrade their hardware without losing access to their existing software libraries. This backward compatibility is achieved through a mode known as Compatibility Mode, enabling users to benefit from both newer 64-bit applications and older 32-bit applications.

AMD64 also incorporates several advanced technologies to optimize performance. One such technology is the support for multiple cores and simultaneous multithreading (SMT). This allows processors to handle multiple threads concurrently, improving overall performance, especially in multi-tasking and multi-threaded applications. With the rise of multi-core processors, AMD64 has gained traction in both consumer and enterprise markets, providing users with an efficient computing experience.

Additionally, AMD64 supports advanced vector extensions (AVX), which enhance the capability of processors to perform single instruction, multiple data (SIMD) operations. This is particularly beneficial for tasks such as video encoding, scientific simulations, and cryptography, allowing these processes to be executed much faster, thereby increasing overall throughput.

Security features are also integrated within AMD64 architecture. Technologies like AMD Secure Execution and Secure Memory Encryption help protect sensitive data and provide an enhanced security environment for virtualized systems.

In summary, AMD64 is a powerful and versatile architecture that extends the capabilities of x86, offering enhanced memory addressing, backward compatibility, multi-core processing, vector extensions, and robust security features. These innovations have positioned AMD as a strong competitor in the computing landscape, catering to the demands of modern users and applications. The continuous evolution of AMD64 technology demonstrates AMD's commitment to pushing the boundaries of computing performance and efficiency.

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

3.1.2Multiple Threads-Shared Data

3.1.3Scheduling on a Non-Idle System

3.2Data Locality Considerations

64 specifications