Non-uniform Memory Access Research Articles

Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc. This has resulted in severe resource contention and Non-Uniform Memory Access (NUMA) that have posed serious challenges to model and algorithm developers. Moreover, the accelerators feature limited main memory compared to the multicore CPU host and are connected to it via limited bandwidth PCI-E links thereby requiring support for efficient out-of-card execution. To summarize, the complexities (resource contention, NUMA, accelerator-specific limitations, etc.) have introduced new challenges to optimization of data-parallel applications on these platforms for performance. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance. We then propose a new model-based data partitioning algorithm, which minimizes the execution time of computations in the parallel execution of the application. This algorithm takes as input a set of $p$ discrete speed functions corresponding to $p$ available heterogeneous processors. It does not make any assumptions about the shapes of these functions. We prove the correctness of the algorithm and its complexity of $O(m^3 \times p^3)$ , where $m$ is the cardinality of the input discrete speed functions. We experimentally demonstrate the optimality and efficiency of our algorithm using two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous cluster of nodes where each node contains an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor.

Read full abstract

An unstructured electrostatic Particle-In-Cell (EUPIC) method is developed on arbitrary tetrahedral grids for simulation of plasmas bounded by arbitrary geometries. The electric potential in EUPIC is obtained on cell vertices from a finite volume Multi-Point Flux Approximation of Gauss' law using the indirect dual cell with Dirichlet, Neumann and external circuit boundary conditions. The resulting matrix equation for the nodal potential is solved with a restarted generalized minimal residual method (GMRES) and an ILU(0) preconditioner algorithm, parallelized using a combination of node coloring and level scheduling approaches. The electric field on vertices is obtained using the gradient theorem applied to the indirect dual cell. The algorithms for injection, particle loading, particle motion, and particle tracking are parallelized for unstructured tetrahedral grids. The algorithms for the potential solver, electric field evaluation, loading, scatter-gather algorithms are verified using analytic solutions for test cases subject to Laplace and Poisson equations. Grid sensitivity analysis examines the L2 and L∞ norms of the relative error in potential, field, and charge density as a function of edge-averaged and volume-averaged cell size. Analysis shows second order of convergence for the potential and first order of convergence for the electric field and charge density. Temporal sensitivity analysis is performed and the momentum and energy conservation properties of the particle integrators in EUPIC are examined. The effects of cell size and timestep on heating, slowing-down and the deflection times are quantified. The heating, slowing-down and the deflection times are found to be almost linearly dependent on number of particles per cell. EUPIC simulations of current collection by cylindrical Langmuir probes in collisionless plasmas show good comparison with previous experimentally validated numerical results. These simulations were also used in a parallelization efficiency investigation. Results show that the EUPIC has efficiency of more than 80% when the simulation is performed on a single CPU from a non-uniform memory access node and the efficiency is decreasing as the number of threads further increases. The EUPIC is applied to the simulation of the multi-species plasma flow over a geometrically complex CubeSat in Low Earth Orbit. The EUPIC potential and flowfield distribution around the CubeSat exhibit features that are consistent with previous simulations over simpler geometrical bodies.

Read full abstract

Non-uniform Memory Access Research Articles

Related Topics

Articles published on Non-uniform Memory Access

Analysis of Memory System of Tiled Many-Core Processors

Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory

Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

How to implement any concurrent data structure

A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms

Assisting High-Level Synthesis Improve SpMV Benchmark Through Dynamic Dependence Analysis

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms

Understanding the performance of storage class memory file systems in the NUMA architecture

Optimization of Remote Core Locking Synchronization in Multithreaded Programs for Multicore Computer Systems

Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures

A parallel electrostatic Particle-in-Cell method on unstructured tetrahedral grids for large-scale bounded collisionless plasma simulations

Graph partitioning applied to DAG scheduling to reduce NUMA effects

Algorithms for Optimization of Processor and Memory Affinity for Remote Core Locking Synchronization in Multithreaded Applications

NUMA-Aware Thread Scheduling for Big Data Transfers over Terabits Network Infrastructure

Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization

Parallel Data Partitioning Algorithms for Optimization of Data-Parallel Applications on Modern Extreme-Scale Multicore Platforms for Performance and Energy

Performance Optimization of Multithreaded 2D Fast Fourier Transform on Multicore Processors Using Load Imbalancing Parallel Computing Method

Evaluating architecture impact on system energy efficiency.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Non-uniform Memory Access Research Articles

Related Topics

Articles published on Non-uniform Memory Access

Analysis of Memory System of Tiled Many-Core Processors

Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory

Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

How to implement any concurrent data structure

A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms

Assisting High-Level Synthesis Improve SpMV Benchmark Through Dynamic Dependence Analysis

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms

Understanding the performance of storage class memory file systems in the NUMA architecture

Optimization of Remote Core Locking Synchronization in Multithreaded Programs for Multicore Computer Systems

Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures

A parallel electrostatic Particle-in-Cell method on unstructured tetrahedral grids for large-scale bounded collisionless plasma simulations

Graph partitioning applied to DAG scheduling to reduce NUMA effects

Algorithms for Optimization of Processor and Memory Affinity for Remote Core Locking Synchronization in Multithreaded Applications

NUMA-Aware Thread Scheduling for Big Data Transfers over Terabits Network Infrastructure

Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization

Parallel Data Partitioning Algorithms for Optimization of Data-Parallel Applications on Modern Extreme-Scale Multicore Platforms for Performance and Energy

Performance Optimization of Multithreaded 2D Fast Fourier Transform on Multicore Processors Using Load Imbalancing Parallel Computing Method

Evaluating architecture impact on system energy efficiency.