Non-uniform Memory Access Research Articles

Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models, for both computing and memory resources, while achieving near-optimal data locality. Our run-time algorithms for NUMA-aware task and data placement are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables fine-grained dynamic control over the placement of application data. We demonstrate that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5× compared to a NUMA-aware hierarchical work-stealing baseline.

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete-value nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. Using multi-threading for cache-friendly, fine-grained parallelization can outperform coarse-grained alternatives which are often less cache-friendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-effectiveness, energy efficiency (‘green computing’), and higher speed on multi-core x86 processors.

Non-uniform Memory Access Research Articles

Related Topics

Articles published on Non-uniform Memory Access

Modeling memory access behavior for data mapping

Scalable adaptive NUMA-aware lock

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Distributed Halide

High performance 2D simulations for the problem of optical breakdown

Exploiting task and data parallelism in ILUPACK’s preconditioned CG solver on NUMA architectures and many-core accelerators

The Effect of NUMA Tunings on CPU Performance

Challenges of Memory Management on Modern NUMA System

Palirria: accurate on‐line parallelism estimation for adaptive work‐stealing

SymS: a symmetrical scheduler to improve multi‐threaded program performance on NUMA systems

Scaling Runtimes for Irregular Algorithms to Large-Scale NUMA Systems

Scaling up concurrent main-memory column-store scans

Achieving High Performance With TCP Over 40 GbE on NUMA Architectures for CMS Data Acquisition

NUMA-Aware Scalable and Efficient In-Memory Aggregation on Large Domains

Characterizing communication and page usage of parallel applications for thread and data mapping

NumaGiC

NumaGiC

Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Design, Implementation, and Evaluation of a NUMA-Aware Cache for iSCSI Storage Servers

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Non-uniform Memory Access Research Articles

Related Topics

Articles published on Non-uniform Memory Access

Modeling memory access behavior for data mapping

Scalable adaptive NUMA-aware lock

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Distributed Halide

High performance 2D simulations for the problem of optical breakdown

Exploiting task and data parallelism in ILUPACK’s preconditioned CG solver on NUMA architectures and many-core accelerators

The Effect of NUMA Tunings on CPU Performance

Challenges of Memory Management on Modern NUMA System

Palirria: accurate on‐line parallelism estimation for adaptive work‐stealing

SymS: a symmetrical scheduler to improve multi‐threaded program performance on NUMA systems

Scaling Runtimes for Irregular Algorithms to Large-Scale NUMA Systems

Scaling up concurrent main-memory column-store scans

Achieving High Performance With TCP Over 40 GbE on NUMA Architectures for CMS Data Acquisition

NUMA-Aware Scalable and Efficient In-Memory Aggregation on Large Domains

Characterizing communication and page usage of parallel applications for thread and data mapping

NumaGiC

NumaGiC

Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Design, Implementation, and Evaluation of a NUMA-Aware Cache for iSCSI Storage Servers