Parallel Overhead Research Articles

The multicore processor architectures have been gaining increasing popularity in the recent years. However, many available applications cannot take full advantage of these architectures. Therefore, many researchers have developed several characterization techniques to help programmers understand the behavior of these applications on multicore platforms and to tune them for better efficiency. This paper proposes an on-the-fly, configuration-independent characterization approach for characterizing the inherent characteristics of multicore applications. This approach is fast because it does not depend on the details of any specific machine configuration and does not require repeating the characterization for every target configuration. It just keeps track of memory accesses and the cores that perform these accesses through piping memory traces, on-the-fly, to the analysis tool. We applied this approach to characterize eight applications drawn from SPLASH-2 and PARSEC benchmark suites. This paper presents the inherent characteristics of these applications including memory access instructions, communication characteristics patterns, sharing degree, invalidation degree, communication slack, and communication locality. The results show that two of the studied applications have high parallelization overhead, which are Cholesky and Fluidanimate. The results also indicate that studied applications of SPLASH-2 have higher communication rates than the studied applications of PARSEC and these rates generally increase as the number of used threads increases. Most of the sharing and invalidation occurs in small degrees. However, two of SPLASH-2 applications have significant fraction of communication with high sharing degrees involving four or more threads. Most of the applications have some uniform communication component and the initial thread is generally involved in more communication compared to the other threads.

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete-value nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. Using multi-threading for cache-friendly, fine-grained parallelization can outperform coarse-grained alternatives which are often less cache-friendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-effectiveness, energy efficiency (‘green computing’), and higher speed on multi-core x86 processors.

Parallel Overhead Research Articles

Related Topics

Articles published on Parallel Overhead

Flicker Propagation in Power Networks with Hybrid and Parallel Overhead Transmission Lines

Characterization of Shared-Memory Multi-Core Applications

Tuning the victim selection policy of Intel TBB

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Synchronous parallel Kinetic Monte Carlo: Implementation and results for object and lattice approaches

A Two-Level Multithreaded Delaunay Kernel

Implementation of a Parallel XTS Encryption Mode of Operation

Implementation of a parallel XTS encryption mode of operation

Exploiting Thread-Level Parallelism Based on Balancing Load for Speculative Multithreading

Power flow for general mixed distribution networks

Optimal Task Mapping for NEMO Model

Limits of Parallelism and Boosting in Dim Silicon

Effects of pantograph arcing on railway systems with auto transformers

Finite-element-wise domain decomposition iterative solvers with polynomial preconditioning

Modal Domain-Based Modeling of Parallel Transmission Lines With Emphasis on Accurate Representation of Mutual Coupling Effects

OpenMP-based parallel transient stability simulation for large-scale power systems

TBBench: A Micro-Benchmark Suite for Intel Threading Building Blocks

Generalizing Amdahl's Law for Power and Energy

Fast $\ell_1$-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

Parallel anisotropic mesh adaptivity with dynamic load balancing for cardiac electrophysiology

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Parallel Overhead Research Articles

Related Topics

Articles published on Parallel Overhead

Flicker Propagation in Power Networks with Hybrid and Parallel Overhead Transmission Lines

Characterization of Shared-Memory Multi-Core Applications

Tuning the victim selection policy of Intel TBB

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Synchronous parallel Kinetic Monte Carlo: Implementation and results for object and lattice approaches

A Two-Level Multithreaded Delaunay Kernel

Implementation of a Parallel XTS Encryption Mode of Operation

Implementation of a parallel XTS encryption mode of operation

Exploiting Thread-Level Parallelism Based on Balancing Load for Speculative Multithreading

Power flow for general mixed distribution networks

Optimal Task Mapping for NEMO Model

Limits of Parallelism and Boosting in Dim Silicon

Effects of pantograph arcing on railway systems with auto transformers

Finite-element-wise domain decomposition iterative solvers with polynomial preconditioning

Modal Domain-Based Modeling of Parallel Transmission Lines With Emphasis on Accurate Representation of Mutual Coupling Effects

OpenMP-based parallel transient stability simulation for large-scale power systems

TBBench: A Micro-Benchmark Suite for Intel Threading Building Blocks

Generalizing Amdahl's Law for Power and Energy

Fast $\ell_1$-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

Parallel anisotropic mesh adaptivity with dynamic load balancing for cardiac electrophysiology