Abstract
The speed of distributed matrix computations over large clusters is often dominated by the stragglers (slow or failed worker nodes). Several techniques based on coding theory have been introduced to mitigate the straggler issue where every worker node is assigned smaller task(s) of multiplying encoded submatrices of the original matrices. However, many of these methods consider the stragglers as erasures, i.e., they discard the potentially useful partial computations done by the slower workers. Moreover, the "input" matrices can be sparse in many scenarios. In this case encoding schemes that combine a large number of input submatrices can adversely affect the worker computation time.In this work, we proposed an integrated approach which addresses both of the issues mentioned above. We allow limited amount of encoding for the submatrices of both A and B; this helps us to preserve the sparsity of the encoded matrices, so that the worker computation can be fast. Our approach provides a trade-off between straggler resilience and worker computation speed, while utilizing partial computations at the workers. Crucially, at one operating point we can ensure that the failure resilience of the system is optimal. Comprehensive numerical analysis done in Amazon Web Services (AWS) cluster confirms the superiority of our approach when compared with previous methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have