An Integrated Method to Deal with Partial Stragglers and Sparse Matrices in Distributed Computations

Anindya Bijoy Das,Aditya Ramamoorthy

doi:10.1109/isit50566.2022.9834346

Anindya Bijoy Das, Aditya Ramamoorthy

https://doi.org/10.1109/isit50566.2022.9834346

Copy DOI

Export

Save

Cite

Publication Date: Jun 26, 2022

Citations: 2

Affiliation: Iowa State University

Abstract
Full-Text
Similar Papers

Abstract

Listen

The speed of distributed matrix computations over large clusters is often dominated by the stragglers (slow or failed worker nodes). Several techniques based on coding theory have been introduced to mitigate the straggler issue where every worker node is assigned smaller task(s) of multiplying encoded submatrices of the original matrices. However, many of these methods consider the stragglers as erasures, i.e., they discard the potentially useful partial computations done by the slower workers. Moreover, the "input" matrices can be sparse in many scenarios. In this case encoding schemes that combine a large number of input submatrices can adversely affect the worker computation time.In this work, we proposed an integrated approach which addresses both of the issues mentioned above. We allow limited amount of encoding for the submatrices of both A and B; this helps us to preserve the sparsity of the encoded matrices, so that the worker computation can be fast. Our approach provides a trade-off between straggler resilience and worker computation speed, while utilizing partial computations at the workers. Crucially, at one operating point we can ensure that the failure resilience of the system is optimal. Comprehensive numerical analysis done in Amazon Web Services (AWS) cluster confirms the superiority of our approach when compared with previous methods.

Full Text