Job Completion Research Articles

Recognizing the diversity of Big Data analytic jobs, cloud providers offer a wide range of VM instance types or even clusters to cater for different use cases. The choice of cloud configurations can have a significant impact on the response time and running cost of batch-processing applications, which may need to be re-run regularly with cloud-scale resources. However, identifying the best cloud configuration with a low search cost is quite challenging due to i) the large and high-dimensional configuration space, ii) the time-varying cloud service cost (e.g., AWS Spot instances), and iii) job response time variation even given the same configuration. To tackle these challenges, we design and implement <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Accordia , a system that enables Adaptive Cloud Configuration Optimization for Recurring Data-Intensive Applications. By leveraging recent algorithmic advances in Gaussian Process UCB techniques, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Accordia can unearth the cost-optimal configuration with a deadline constraint (i.e., maximum tolerated running time) under the time-varying cloud service cost. More importantly, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Accordia manages to achieve a theoretical performance guarantee, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">sub-linearly increasing dynamic regret of the job completion cost. Using extensive trace-driven simulations and empirical measurements of our Kubernetes-based implementation, we demonstrate that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Accordia can identify a near-cost-optimal configuration (i.e., within 10% of the optimum) after fewer than 20 runs from over 7000 candidate choices, which translates to a 2X-speedup and up to 17.9% cost-savings, when comparing to the state-of-the-art approach, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CherryPick .

Read full abstract

With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications running on modern platforms, it is crucial to satisfy application performance requirements such as meeting deadline and ensuring accuracy. To this end, researchers have proposed several job schedulers for ML clusters. However, none of the previously proposed schedulers consider ML model parallelism, though it has been proposed as an approach to increase the efficiency of running large-scale ML and DL jobs. Thus, in this paper, we propose an ML job Feature based job Scheduling system (MLFS) for ML clusters running both data parallelism and model parallelism ML jobs. MLFS first uses a heuristic scheduling method that considers an ML job’s spatial and temporal features to determine task priority for job queue ordering in order to improve job completion time (JCT) and accuracy performance. It uses the data from the heuristic scheduling method for training a deep reinforcement learning (RL) model. After the RL model is well trained, it then switches to the RL method to automatically make decisions on job scheduling. In addition, MLFS has a system load control method that selects tasks from overloaded servers to move to underloaded servers based on task priority, and also intelligently removes the tasks that generate little or no improvement on the desired accuracy performance when the system is overloaded to improve JCT and accuracy by job deadline. Furthermore, we propose Optimal ML iteration stopping method that determines the proper time to stop training ML model when this model reaches the minimum loss value. Our real experiments and large-scale simulation based on real trace show that MLFS reduces JCT by up to 53% and makespan by up to 52%, and improves accuracy by up to 64% when compared with existing ML job schedulers. We also open sourced our code.

Read full abstract

Job Completion Research Articles

Related Topics

Articles published on Job Completion

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs

12 Exposure to Metals and Particles in Welding and Episodes of Asthma/Wheeze and Rhinitis: a Canadian Cohort Study.

An Algebraic Approach to the Solutions of the Open Shop Scheduling Problem

Cloud Configuration Optimization for Recurring Batch-Processing Applications

Online scheduling of coflows by attention-empowered scalable deep reinforcement learning

DFS: Joint data formatting and sparsification for efficient communication in Distributed Machine Learning

Multi-Stage Geo-Distributed Data Aggregation With Coordinated Computation and Communication in Edge Compute First Networking

The Emergence of Public Sector Innovation Associated with Civil Servants' Perception in 3T Regions: Results of a Multiple Regression Analysis

Online Scheduling Algorithm for Heterogeneous Distributed Machine Learning Jobs

Elastic Resource Management for Deep Learning Applications in a Container Cluster

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters

On the use of intelligent metasurfaces in data centers

Increased Task Execution with a Bandwidth-Aware Hadoop Scheduling Approach

Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters

Software-defined networking enabled big data tasks scheduling: A tabu search approach

Distributed job allocation using response threshold for heterogeneous robot team under deadline constraints

Design of deep learning model for radio resource allocation in 5G for massive iot device

MPU-6050 Wheeled Robot Controlled Hand Gesture Using L298N Driver Based on Arduino

On preemptive scheduling on unrelated machines using linear programming

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Job Completion Research Articles

Related Topics

Articles published on Job Completion

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs

12 Exposure to Metals and Particles in Welding and Episodes of Asthma/Wheeze and Rhinitis: a Canadian Cohort Study.

An Algebraic Approach to the Solutions of the Open Shop Scheduling Problem

Cloud Configuration Optimization for Recurring Batch-Processing Applications

Online scheduling of coflows by attention-empowered scalable deep reinforcement learning

DFS: Joint data formatting and sparsification for efficient communication in Distributed Machine Learning

Multi-Stage Geo-Distributed Data Aggregation With Coordinated Computation and Communication in Edge Compute First Networking

The Emergence of Public Sector Innovation Associated with Civil Servants' Perception in 3T Regions: Results of a Multiple Regression Analysis

Online Scheduling Algorithm for Heterogeneous Distributed Machine Learning Jobs

Elastic Resource Management for Deep Learning Applications in a Container Cluster

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters

On the use of intelligent metasurfaces in data centers

Increased Task Execution with a Bandwidth-Aware Hadoop Scheduling Approach

Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters

Software-defined networking enabled big data tasks scheduling: A tabu search approach

Distributed job allocation using response threshold for heterogeneous robot team under deadline constraints

Design of deep learning model for radio resource allocation in 5G for massive iot device

MPU-6050 Wheeled Robot Controlled Hand Gesture Using L298N Driver Based on Arduino

On preemptive scheduling on unrelated machines using linear programming