Data-parallel Applications Research Articles

Recently, the abstraction of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">coflow is introduced to capture the collective data transmission patterns among modern distributed data-parallel applications. During processing, coflows generally act as barriers; accordingly, time-sensitive applications prefer their coflows to complete within deadlines, and deadline-aware coflow scheduling becomes very crucial. Regarding these data-parallel applications, we notice that many of them, including <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">large-scale query systems , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">distributed iterative training , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">erasure codes enabled storage , are able to tolerate loss-bounded incomplete inputs by design. This tolerance indeed brings a flexible design space for the schedule of their coflows: when getting overloaded, the network can trade coflow completeness for the timeliness, and balance the completeness of different coflows on demand. Unfortunately, existing coflow schedulers neglect this tolerance, resulting in inflexible and inefficient bandwidth allocations. In this paper, we explore this fundamental trade-off and design POCO, a POlicy-based COflow scheduler, along with a transport layer enhancement scheme, to achieve customizable selective coflow completion for emerging time-sensitive distributed applications. Internally, POCO employs a suite of novel designs along with admission controls to make <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">flexible , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">work-conserving , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">performance-guaranteed rate allocation to online coflow requests very efficiently. Extensive trace-based simulations indicate that POCO is highly flexible and achieves optimal coflow schedules respecting the requirements specified by applications.

Read full abstract

Accelerating the bi-objective optimization of applications for performance and energy is crucial to achieving energy efficiency objectives and meeting quality-of-service requirements in modern high-performance computing platforms and cloud computing infrastructures. In this work, we highlight the crucial challenges to accelerate model-based methods proposed for the bi-objective optimization of data-parallel applications for performance and energy that employ workload distribution between the executing processors as the decision variable. The methods solve unconstrained bi-objective optimization problems and take input, the processors’ performance and energy profiles in the form of discrete functions of workload size, and output Pareto-optimal solutions (workload distributions), minimizing the execution time and the total energy consumption of computations during the parallel execution of the application. One of the challenges is the fast computation of Pareto-optimal solutions. We then formulate the bi-objective optimization problem of data-parallel applications for performance and energy through workload distribution on a cluster of p identical hybrid nodes, each containing h heterogeneous processors. The state-of-the-art algorithm for solving the problem is sequential and takes exorbitant execution times to find Pareto-optimal solutions for even moderate numbers of processors. We propose two algorithms that address this shortcoming. The first algorithm is an exact sequential algorithm that is more efficient and amenable to parallelization and achieves a complexity reduction of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O (m × h) over the state-of-the-art sequential algorithm where m is the cardinality of the input discrete execution time and dynamic energy functions. The second algorithm is a parallel algorithm executed by q identical parallel processes that reduces the complexity of our proposed sequential algorithm by <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O (q) and therefore achieves a complexity reduction of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O (m×h×q) over the state-of-the-art sequential algorithm. Finally, we experimentally analyze the practical efficacy of our proposed algorithms for two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous hybrid node containing an Intel Haswell multicore CPU, an Nvidia k40c GPU, and an Nvidia P100 GPU and simulations of clusters of such hybrid nodes. The experiments demonstrate that our proposed algorithms provide tremendous speedups over state-of-the-art solutions.

Read full abstract

Data-parallel Applications Research Articles

Related Topics

Articles published on Data-parallel Applications

Implementation and Evaluation of SIMD Instructions using RISC-V

Block size estimation for data partitioning in HPC applications using machine learning techniques

Meeting Coflow Deadlines in Data Center Networks With Policy-Based Selective Completion

Reducing energy consumption using heterogeneous voltage frequency scaling of data-parallel applications for multicore systems

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

Bottleneck-Aware Non-Clairvoyant Coflow Scheduling With Fai

Combining stream with data parallelism abstractions for multi-cores

COX : Exposing CUDA Warp-level Functions to CPUs

Optimization of heterogeneous systems with AI planning heuristics and machine learning: a performance and energy aware approach

Bi-Objective Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms for Performance and Energy Through Workload Distribution

HCE: A Runtime System for Efficiently Supporting Heterogeneous Cooperative Execution

Joint Coflow Optimization for Data Center Networks

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

Towards Optimal Matrix Partitioning for Data Parallel Computing on a Hybrid Heterogeneous Server

A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures

Resource-Aware Device Allocation of Data-Parallel Applications on Heterogeneous Systems

SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUs

Dynamic Memory Bandwidth Allocation for Real-Time GPU-Based SoC Platforms

Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Scheduling Mix-Coflows in Datacenter Networks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data-parallel Applications Research Articles

Related Topics

Articles published on Data-parallel Applications

Implementation and Evaluation of SIMD Instructions using RISC-V

Block size estimation for data partitioning in HPC applications using machine learning techniques

Meeting Coflow Deadlines in Data Center Networks With Policy-Based Selective Completion

Reducing energy consumption using heterogeneous voltage frequency scaling of data-parallel applications for multicore systems

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

Bottleneck-Aware Non-Clairvoyant Coflow Scheduling With Fai

Combining stream with data parallelism abstractions for multi-cores

COX : Exposing CUDA Warp-level Functions to CPUs

Optimization of heterogeneous systems with AI planning heuristics and machine learning: a performance and energy aware approach

Bi-Objective Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms for Performance and Energy Through Workload Distribution

HCE: A Runtime System for Efficiently Supporting Heterogeneous Cooperative Execution

Joint Coflow Optimization for Data Center Networks

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

Towards Optimal Matrix Partitioning for Data Parallel Computing on a Hybrid Heterogeneous Server

A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures

Resource-Aware Device Allocation of Data-Parallel Applications on Heterogeneous Systems

SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUs

Dynamic Memory Bandwidth Allocation for Real-Time GPU-Based SoC Platforms

Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Scheduling Mix-Coflows in Datacenter Networks