Straggler Mitigation at Scale

Mehmet Fatih Aktas,Emina Soljanin

doi:10.1109/tnet.2019.2946464

Abstract

Runtime performance variability has been a major issue, hindering predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers have been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. The tail heaviness of service time variability is decisive on the pain and gain of redundancy and we quantify its effect by deriving expressions for cost and latency. Specifically, we try to answer four questions: 1) How do replicated and coded redundancy compare in the cost vs. latency tradeoff? 2) Can we introduce redundancy after waiting some time and expect it to reduce the cost? 3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together? We validate the answers we found for each of these questions via simulations that use empirical distributions extracted from a Google cluster data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Straggler Mitigation at Scale

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Networking

Lead the way for us

Journal: IEEE/ACM Transactions on Networking	Publication Date: Dec 1, 2019
Citations: 77

Similar Papers

Straggler Mitigation by Delayed Relaunch of Tasks
Mehmet Fatih Aktas ... Emina Soljanin
ACM SIGMETRICS Performance Evaluation Review | VOL. 45
Mehmet Fatih Aktas, et. al.Mehmet Fatih Aktas ... Emina Soljanin
20 Mar 2018
ACM SIGMETRICS Performance Evaluation Review | VOL. 45

Effective Straggler Mitigation
Mehmet Fatih Aktas ... Emina Soljanin
ACM SIGMETRICS Performance Evaluation Review | VOL. 45
Mehmet Fatih Aktas, et. al.Mehmet Fatih Aktas ... Emina Soljanin
11 Oct 2017
ACM SIGMETRICS Performance Evaluation Review | VOL. 45

Co-optimization of a Multi-Energy Microgrid Considering Multiple Services
H Wang ... E A Martinez Cesena
-
H Wang, et. al.H Wang ... E A Martinez Cesena
01 Jun 2018
01 Jun 2018

Parallel application-level behavioral attributes for performance and energy management of high-performance computing systems
Jeffrey J Evans ... Charles E Lucas
Cluster Computing | VOL. 16
Jeffrey J Evans, et. al.Jeffrey J Evans ... Charles E Lucas
17 Dec 2011
Cluster Computing | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Straggler Mitigation at Scale

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Networking