Discrete Speed Research Articles

Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of $h$ identical nodes where each node has $c$ heterogeneous processors. This algorithm takes as input $c$ discrete speed functions of cardinality $m$ corresponding to the $c$ heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of $O(m^{2} \times h + m^{3} \times c^{3})$ unlike the state-of-the-art algorithm solving the same problem with the complexity of $O(m^{3} \times c^{3} \times h^{3})$ . We also propose an extension of the algorithm for clusters of $h$ non-identical nodes where each node has $c$ heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.

Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc. This has resulted in severe resource contention and Non-Uniform Memory Access (NUMA) that have posed serious challenges to model and algorithm developers. Moreover, the accelerators feature limited main memory compared to the multicore CPU host and are connected to it via limited bandwidth PCI-E links thereby requiring support for efficient out-of-card execution. To summarize, the complexities (resource contention, NUMA, accelerator-specific limitations, etc.) have introduced new challenges to optimization of data-parallel applications on these platforms for performance. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance. We then propose a new model-based data partitioning algorithm, which minimizes the execution time of computations in the parallel execution of the application. This algorithm takes as input a set of $p$ discrete speed functions corresponding to $p$ available heterogeneous processors. It does not make any assumptions about the shapes of these functions. We prove the correctness of the algorithm and its complexity of $O(m^3 \times p^3)$ , where $m$ is the cardinality of the input discrete speed functions. We experimentally demonstrate the optimality and efficiency of our algorithm using two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous cluster of nodes where each node contains an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor.

Discrete Speed Research Articles

Related Topics

Articles published on Discrete Speed

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

Geodynamics of very high speed transport systems

Turn‐to‐turn short circuit of motor stator fault diagnosis in continuous state based on deep auto‐encoder

Discretely variable speed ratio control strategy for continuously variable transmission system considering hydraulic energy loss

Discrete Train Speed Profile Optimization for Urban Rail Transit: A Data-Driven Model and Integrated Algorithms Based on Machine Learning

Predictive Maneuver Planning for an Autonomous Vehicle in Public Highway Traffic

Optimization of tow-steered composite wind turbine blades for static aeroelastic performance

Elucidating how correlated operation of shear transformation zones leads to shear localization and fracture in metallic glasses: Tensile tests on Cu[sbnd]Zr based metallic-glass microwires, molecular dynamics simulations, and modelling

Kinematic Correlates of Kinetic Outcomes Associated With Running-Related Injury.

The bullet problem with discrete speeds

Travel Time Functions Prediction for Time-Dependent Networks

A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms

Optimal task execution speed setting and lower bound for delay and energy minimization

Improvement of jaw crushers reliability using elastic pneumatic elements in the connection of kinematic pairs

Hybrid Computational Mechanical Sensorless Fuzzified Technique for Speed Estimation of Permanent Magnet Direct Current Brushed Motor

Application of finitized power series distributions to accelerated variate generation. Part II: the case of the logarithmic distribution

Free Convection Heat Transfer of Nanofluids into Cubical Enclosures with a Bottom Heat Source: Lattice Boltzmann Application

Discrete Adaptive Speed Sensorless Drive of Induction Motors

Flow shop for dual CPUs in dynamic voltage scaling

Energy-efficient real-time scheduling for two-type heterogeneous multiprocessors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Discrete Speed Research Articles

Related Topics

Articles published on Discrete Speed

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

Geodynamics of very high speed transport systems

Turn‐to‐turn short circuit of motor stator fault diagnosis in continuous state based on deep auto‐encoder

Discretely variable speed ratio control strategy for continuously variable transmission system considering hydraulic energy loss

Discrete Train Speed Profile Optimization for Urban Rail Transit: A Data-Driven Model and Integrated Algorithms Based on Machine Learning

Predictive Maneuver Planning for an Autonomous Vehicle in Public Highway Traffic

Optimization of tow-steered composite wind turbine blades for static aeroelastic performance

Elucidating how correlated operation of shear transformation zones leads to shear localization and fracture in metallic glasses: Tensile tests on Cu[sbnd]Zr based metallic-glass microwires, molecular dynamics simulations, and modelling

Kinematic Correlates of Kinetic Outcomes Associated With Running-Related Injury.

The bullet problem with discrete speeds

Travel Time Functions Prediction for Time-Dependent Networks

A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms

Optimal task execution speed setting and lower bound for delay and energy minimization

Improvement of jaw crushers reliability using elastic pneumatic elements in the connection of kinematic pairs

Hybrid Computational Mechanical Sensorless Fuzzified Technique for Speed Estimation of Permanent Magnet Direct Current Brushed Motor

Application of finitized power series distributions to accelerated variate generation. Part II: the case of the logarithmic distribution

Free Convection Heat Transfer of Nanofluids into Cubical Enclosures with a Bottom Heat Source: Lattice Boltzmann Application

Discrete Adaptive Speed Sensorless Drive of Induction Motors

Flow shop for dual CPUs in dynamic voltage scaling

Energy-efficient real-time scheduling for two-type heterogeneous multiprocessors