Non-Uniform Memory Access Nodes Research Articles

Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of $h$ identical nodes where each node has $c$ heterogeneous processors. This algorithm takes as input $c$ discrete speed functions of cardinality $m$ corresponding to the $c$ heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of $O(m^{2} \times h + m^{3} \times c^{3})$ unlike the state-of-the-art algorithm solving the same problem with the complexity of $O(m^{3} \times c^{3} \times h^{3})$ . We also propose an extension of the algorithm for clusters of $h$ non-identical nodes where each node has $c$ heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.

Read full abstract

An unstructured electrostatic Particle-In-Cell (EUPIC) method is developed on arbitrary tetrahedral grids for simulation of plasmas bounded by arbitrary geometries. The electric potential in EUPIC is obtained on cell vertices from a finite volume Multi-Point Flux Approximation of Gauss' law using the indirect dual cell with Dirichlet, Neumann and external circuit boundary conditions. The resulting matrix equation for the nodal potential is solved with a restarted generalized minimal residual method (GMRES) and an ILU(0) preconditioner algorithm, parallelized using a combination of node coloring and level scheduling approaches. The electric field on vertices is obtained using the gradient theorem applied to the indirect dual cell. The algorithms for injection, particle loading, particle motion, and particle tracking are parallelized for unstructured tetrahedral grids. The algorithms for the potential solver, electric field evaluation, loading, scatter-gather algorithms are verified using analytic solutions for test cases subject to Laplace and Poisson equations. Grid sensitivity analysis examines the L2 and L∞ norms of the relative error in potential, field, and charge density as a function of edge-averaged and volume-averaged cell size. Analysis shows second order of convergence for the potential and first order of convergence for the electric field and charge density. Temporal sensitivity analysis is performed and the momentum and energy conservation properties of the particle integrators in EUPIC are examined. The effects of cell size and timestep on heating, slowing-down and the deflection times are quantified. The heating, slowing-down and the deflection times are found to be almost linearly dependent on number of particles per cell. EUPIC simulations of current collection by cylindrical Langmuir probes in collisionless plasmas show good comparison with previous experimentally validated numerical results. These simulations were also used in a parallelization efficiency investigation. Results show that the EUPIC has efficiency of more than 80% when the simulation is performed on a single CPU from a non-uniform memory access node and the efficiency is decreasing as the number of threads further increases. The EUPIC is applied to the simulation of the multi-species plasma flow over a geometrically complex CubeSat in Low Earth Orbit. The EUPIC potential and flowfield distribution around the CubeSat exhibit features that are consistent with previous simulations over simpler geometrical bodies.

Read full abstract

Non-Uniform Memory Access Nodes Research Articles

Related Topics

Articles published on Non-Uniform Memory Access Nodes

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

A stealing mechanism for delegation methods

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

A parallel electrostatic Particle-in-Cell method on unstructured tetrahedral grids for large-scale bounded collisionless plasma simulations

NUMA-Aware Thread Scheduling for Big Data Transfers over Terabits Network Infrastructure

How to implement any concurrent data structure for modern servers

Black-box Concurrent Data Structures for NUMA Architectures

Black-box Concurrent Data Structures for NUMA Architectures

Supporting NUMA-Aware I/O in Virtual Machines

Modeling memory access behavior for data mapping

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Characterizing communication and page usage of parallel applications for thread and data mapping

Design, Implementation, and Evaluation of a NUMA-Aware Cache for iSCSI Storage Servers

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

NUMA-aware reader-writer locks

High‐performance execution of service compositions: a multicore‐aware engine design

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Non-Uniform Memory Access Nodes Research Articles

Related Topics

Articles published on Non-Uniform Memory Access Nodes

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

A stealing mechanism for delegation methods

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

A parallel electrostatic Particle-in-Cell method on unstructured tetrahedral grids for large-scale bounded collisionless plasma simulations

NUMA-Aware Thread Scheduling for Big Data Transfers over Terabits Network Infrastructure

How to implement any concurrent data structure for modern servers

Black-box Concurrent Data Structures for NUMA Architectures

Black-box Concurrent Data Structures for NUMA Architectures

Supporting NUMA-Aware I/O in Virtual Machines

Modeling memory access behavior for data mapping

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Characterizing communication and page usage of parallel applications for thread and data mapping

Design, Implementation, and Evaluation of a NUMA-Aware Cache for iSCSI Storage Servers

Scalable black-box prediction models for multi-dimensional adaptation on NUMA multi-cores

NUMA-aware reader-writer locks

High‐performance execution of service compositions: a multicore‐aware engine design