Efficient task replication for fast response times in parallel computation

Da Wang,Gauri Joshi,Gregory Wornell

doi:10.1145/2637364.2592042

Abstract

Large-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and thus the slowest machine becomes the bottleneck in the completion of the job. One way to combat the variability in machine response is to replicate tasks on multiple machines and waiting for the machine that finishes first. While task replication reduces response time, it generally increases resource usage. In this work, we propose a theoretical framework to analyze the trade-off between response time and resource usage. Given an execution time distribution for machines, our analysis gives insights on when and why replication helps. We also propose efficient scheduling algorithms for large-scale distributed computing systems.

Full Text