Chapter Introduction | AMD 64 guide

40555 Rev. 3.00 June 2006	Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
	ccNUMA Multiprocessor Systems

Chapter 1 Introduction

The AMD Athlon™ 64 and AMD Opteron™ family of single-core and dual-core multiprocessor systems are based on the cache coherent Non-Uniform Memory Access (ccNUMA) architecture. In this architecture, each processor has access to its own low-latency, local memory (through the processor’s on-die local memory controller), as well as to higher latency remote memory through the on-die memory controllers of the other processors in the multiprocessor environment. At the same time, the ccNUMA architecture is designed to maintain the cache coherence of the entire shared memory space. The high-performance coherent HyperTransport™ technology interconnects between processors in the multiprocessor system permit remote memory access and cache coherence.

In traditional symmetric multiprocessing (SMP) systems, the various processors share a single memory controller. This single memory connection can become a performance bottleneck when all processors access memory at once. At the same time, the SMP architecture does not scale well into larger systems with a greater number of processors. The AMD ccNUMA architecture is designed to overcome these inherent SMP performance bottlenecks. It is a mature architecture that is designed to extract greater performance potential from multiprocessor systems.

As developers deploy more demanding workloads on these multiprocessor systems, common performance questions arise: Where should threads or processes be scheduled (thread or process placement)? Where should memory be allocated (memory placement)? The underlying operating system (OS), tuned for AMD Athlon 64 and AMD Opteron multiprocessor ccNUMA systems, makes these performance decisions transparent and easy.

Advanced developers, however, should be aware of the more advanced tools and techniques available for performance tuning. In addition to recommending mechanisms provided by the OS for explicit thread (or process) and memory placement, this application note explores advanced techniques such as node interleaving of memory to boost performance. This document also delves into the characterization of an AMD ccNUMA multiprocessor system, providing advanced developers with an understanding of the fundamentals necessary to enhance the performance of synthetic and real applications and to develop advanced tools.

In general, applications can be memory latency sensitive or memory bandwidth sensitive; both classes are important for performance tuning. In a multiprocessor system, in addition to memory latency and memory bandwidth, other factors influence performance:

•the latency of remote memory access (hop latency)

•the latency of maintaining cache coherence (probe latency)

•the bandwidth of the HyperTransport interconnect links

•the lengths of various buffer queues in the system

The empirical analysis presented in this document is based upon data provided by running a multi- threaded synthetic test. While this test is neither a pure memory latency test nor a pure memory

Chapter 1 Introduction

64 specifications

AMD64 is a 64-bit architecture developed by Advanced Micro Devices (AMD) as an extension of the x86 architecture. Introduced in the early 2000s, it aimed to offer enhanced performance and capabilities to powering modern computing systems. One of the main features of AMD64 is its ability to address a significantly larger amount of memory compared to its 32-bit predecessors. While the old x86 architecture was limited to 4 GB of RAM, AMD64 can theoretically support up to 16 exabytes of memory, making it ideal for applications requiring large datasets, such as scientific computing and complex simulations.

Another key characteristic of AMD64 is its support for backward compatibility. This means that it can run existing 32-bit applications seamlessly, allowing users to upgrade their hardware without losing access to their existing software libraries. This backward compatibility is achieved through a mode known as Compatibility Mode, enabling users to benefit from both newer 64-bit applications and older 32-bit applications.

AMD64 also incorporates several advanced technologies to optimize performance. One such technology is the support for multiple cores and simultaneous multithreading (SMT). This allows processors to handle multiple threads concurrently, improving overall performance, especially in multi-tasking and multi-threaded applications. With the rise of multi-core processors, AMD64 has gained traction in both consumer and enterprise markets, providing users with an efficient computing experience.

Additionally, AMD64 supports advanced vector extensions (AVX), which enhance the capability of processors to perform single instruction, multiple data (SIMD) operations. This is particularly beneficial for tasks such as video encoding, scientific simulations, and cryptography, allowing these processes to be executed much faster, thereby increasing overall throughput.

Security features are also integrated within AMD64 architecture. Technologies like AMD Secure Execution and Secure Memory Encryption help protect sensitive data and provide an enhanced security environment for virtualized systems.

In summary, AMD64 is a powerful and versatile architecture that extends the capabilities of x86, offering enhanced memory addressing, backward compatibility, multi-core processing, vector extensions, and robust security features. These innovations have positioned AMD as a strong competitor in the computing landscape, catering to the demands of modern users and applications. The continuous evolution of AMD64 technology demonstrates AMD's commitment to pushing the boundaries of computing performance and efficiency.