Non-uniform Memory Access Systems Research Articles

This paper introduces two novel algorithms for thread migrations, named CIMAR (Core-aware Interchange and Migration Algorithm with performance Record –IMAR–) and NIMAR (Node-aware IMAR), and a new algorithm for the migration of memory pages, LMMA (Latency-based Memory pages Migration Algorithm), in the context of Non-Uniform Memory Access (NUMA) systems. This kind of system has complex memory hierarchies that present a challenging problem in extracting the best possible performance, where thread and memory mapping play a critical role. The presented algorithms gather and process the information provided by hardware counters to make decisions about the migrations to be performed, trying to find the optimal mapping. They have been implemented as a user space tool that looks for improving the system performance, particularly in, but not restricted to, scenarios where multiple programs with different characteristics are running. This approach has the advantage of not requiring any modification on the target programs or the Linux kernel while keeping a low overhead.Two different benchmark suites have been used to validate our algorithms: The NAS parallel benchmark, mainly devoted to computational routines, and the LevelDB database benchmark focused on read–write operations. These benchmarks allow us to illustrate the influence of our proposal in these two important types of codes. Note that those codes are state-of-the-art implementations of the routines, so few improvements could be initially expected. Experiments have been designed and conducted to emulate three different scenarios: a single program running in the system with full resources, an interactive server where multiple programs run concurrently varying the availability of resources, and a queue of tasks where granted resources are limited. The proposed algorithms have been able to produce significant benefits, especially in systems with higher latency penalties for remote accesses. When more than one benchmark is executed simultaneously, performance improvements have been obtained, reducing execution times up to 60%. In this kind of situation, the behaviour of the system is more critical, and the NUMA topology plays a more relevant role. Even in the worst case, when isolated benchmarks are executed using the whole system, that is, just one task at a time, the performance is not degraded.

Read full abstract

The calculation of macroscopic neutron cross-sections is a fundamental part of the continuous-energy Monte Carlo (MC) neutron transport algorithm. MC simulations of full nuclear reactor cores are computationally expensive, making high-accuracy simulations impractical for most routine reactor analysis tasks because of their long time to solution. Thus, preparation of MC simulation algorithms for next generation supercomputers is extremely important as improvements in computational performance and efficiency will directly translate into improvements in achievable simulation accuracy. Due to the stochastic nature of the MC algorithm, cross-section data tables are accessed in a highly randomized manner, resulting in frequent cache misses and latency-bound memory accesses. Furthermore, contemporary and next generation non-uniform memory access (NUMA) computer architectures, featuring very high latencies and less cache space per core, will exacerbate this behaviour. The absence of a topology-aware allocation strategy in existing high-performance computing (HPC) programming models is a major source of performance problems in NUMA systems. Thus, to improve performance of the MC simulation algorithm, we propose a topology-aware data allocation strategies that allow full control over the location of data structures within a memory hierarchy. A new memory management library, known as AML, has recently been created to facilitate this mapping. To evaluate the usefulness of AML in the context of MC reactor simulations, we have converted two existing MC transport cross-section lookup “proxy-applications” (XSBench and RSBench) to utilize the AML allocation library. In this study, we use these proxy-applications to test several continuous-energy cross-section data lookup strategies (the nuclide grid, unionized grid, logarithmic hash grid, and multipole methods) with a number of AML allocation schemes on a variety of node architectures. We find that the AML library speeds up cross-section lookup performance up to 2x on current generation hardware (e.g., a dual-socket Skylake-based NUMA system) as compared with naive allocation. These exciting results also show a path forward for efficient performance on next-generation exascale supercomputer designs that feature even more complex NUMA memory hierarchies.

Read full abstract

Non-uniform Memory Access Systems Research Articles

Related Topics

Articles published on Non-uniform Memory Access Systems

Improving the accessibility of NUMA‐aware C++ application development based on the PGASUS framework

CIMAR, NIMAR, and LMMA: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters

NUMA-AWARE DATA MANAGEMENT FOR NEUTRON CROSS SECTION DATA IN CONTINUOUS ENERGY MONTE CARLO NEUTRON TRANSPORT SIMULATION

A Performance-Stable NUMA Management Scheme for Linux-Based HPC Systems

Dynamic concurrency throttling on NUMA systems and data migration impacts

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

DeLoc: A Locality and Memory-Congestion-Aware Task Mapping Method for Modern NUMA Systems

Memory‐aware kernel mechanism and policies for improving internode load balancing on NUMA systems

LG-RAM: Load-aware global resource affinity management for virtualized multicore systems

Analysis of Memory System of Tiled Many-Core Processors

Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Performance-Monitoring-Based Traffic-Aware Virtual Machine Deployment on NUMA Systems

Scalable Adaptive NUMA-Aware Lock

History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers

History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers

Evaluation of Performance Unfairness in NUMA System Architecture

在大规模系统上优化 TPC-C 评测程序

Supporting NUMA-Aware I/O in Virtual Machines

Scalable adaptive NUMA-aware lock

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Non-uniform Memory Access Systems Research Articles

Related Topics

Articles published on Non-uniform Memory Access Systems

Improving the accessibility of NUMA‐aware C++ application development based on the PGASUS framework

CIMAR, NIMAR, and LMMA: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters

NUMA-AWARE DATA MANAGEMENT FOR NEUTRON CROSS SECTION DATA IN CONTINUOUS ENERGY MONTE CARLO NEUTRON TRANSPORT SIMULATION

A Performance-Stable NUMA Management Scheme for Linux-Based HPC Systems

Dynamic concurrency throttling on NUMA systems and data migration impacts

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems

DeLoc: A Locality and Memory-Congestion-Aware Task Mapping Method for Modern NUMA Systems

Memory‐aware kernel mechanism and policies for improving internode load balancing on NUMA systems

LG-RAM: Load-aware global resource affinity management for virtualized multicore systems

Analysis of Memory System of Tiled Many-Core Processors

Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Performance-Monitoring-Based Traffic-Aware Virtual Machine Deployment on NUMA Systems

Scalable Adaptive NUMA-Aware Lock

History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers

History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers

Evaluation of Performance Unfairness in NUMA System Architecture

在大规模系统上优化 TPC-C 评测程序

Supporting NUMA-Aware I/O in Virtual Machines

Scalable adaptive NUMA-aware lock