Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework

Shyam Deshmukh,Mohammad Shabaz,Komati Thirupathi Rao,Manjit Kaur

doi:10.1155/2021/8340925

Shyam Deshmukh, Mohammad Shabaz + Show 2 more

Open Access

https://doi.org/10.1155/2021/8340925

Copy DOI

Journal: Security and Communication Networks	Publication Date: May 23, 2021
Citations: 49	License type: CC BY 4.0

Abstract

Modern big data applications tend to prefer a cluster computing approach as they are linked to the distributed computing framework that serves users jobs as per demand. It performs rapid processing of tasks by subdividing them into tasks that execute in parallel. Because of the complex environment, hardware and software issues, tasks might run slowly leading to delayed job completion, and such phenomena are also known as stragglers. The performance improvement of distributed computing framework is a bottleneck by straggling nodes due to various factors like shared resources, heavy system load, or hardware issues leading to the prolonged job execution time. Many state-of-the-art approaches use independent models per node and workload. With increased nodes and workloads, the number of models would increase, and even with large numbers of nodes. Not every node would be able to capture the stragglers as there might not be sufficient training data available of straggler patterns, yielding suboptimal straggler prediction. To alleviate such problems, we propose a novel collaborative learning-based approach for straggler prediction, the alternate direction method of multipliers (ADMM), which is resource-efficient and learns how to efficiently deal with mitigating stragglers without moving data to a centralized location. The proposed framework shares information among the various models, allowing us to use larger training data and bring training time down by avoiding data transfer. We rigorously evaluate the proposed method on various datasets with high accuracy results.

Highlights

Any organization that depends on a cloud computing environment majorly focuses on factors like CPU usage, memory, I/O and Network for performance optimization
All these parameters are susceptible to performance degradation and may result in suboptimal quality of service (QoS). e Google cluster’s trace study is a milestone toward the analysis of workloads in a cloud environment with multiple servers as studied in Dean and Ghemawat [1]; Chen et al [2]; Reiss et al [3]. is provides the analysis of workload data recorded on Google cluster trace. e important contribution is the analysis of many tasks and jobs that offer an efficient allotment of the resources for new upcoming tasks to the cloud data center, thereby increasing a throughput of the data center
In this paper, we propose a Collaborative Learning-based (CL) formulation for learning predictors that are highly accurate and generalize better than multiple independent models. is is based on the alternate direction method of multipliers- (ADMM-) based support vector machine (SVM), proposed by Boyd et al [13]. e proposed model enables the nodes to collectively learn a shared prediction model while keeping all the training data on nodes, decoupling the ability to do ML from the need to store the data in the centralized manner

Summary

Introduction

Any organization that depends on a cloud computing environment majorly focuses on factors like CPU usage, memory, I/O and Network for performance optimization. All these parameters are susceptible to performance degradation and may result in suboptimal quality of service (QoS). Cloud computing and high-performance computing frameworks typically monitor task completion status and launch backup tasks for stragglers during job execution. Such redundant approaches incur huge operational and financial costs. Even with this, they do not provide postevent analyses to diagnose the causes of the stragglers and their proactive prevention. Reactive techniques typically use a criterion of comparing the task execution time with a threshold calculated based on the median value within all the tasks [4]

Methods

Results

Conclusion