Abstract
In wireless distributed computing systems, worker nodes connect to a master node wirelessly and perform large-scale computational tasks that are parallelized across them. However, the common phenomenon of straggling (i.e., worker nodes often experience unpredictable slowdown during computation and communication) and packet losses due to severe channel fading can significantly increase the latency of computational tasks. In this paper, we consider a heterogeneous, wireless, distributed computing system performing large-scale matrix multiplications which form the core of many machine learning applications. To address the aforementioned challenges, we first propose a random linear network coding (RLNC) approach that leverages the linearity of matrix multiplication, which has many salient properties, including ratelessness, maximum straggler tolerance and near-ideal load balancing. We then theoretically demonstrate that its latency converges to the optimum in probability when the matrix size grows to infinity. To combat the high encoding and decoding overheads of the RLNC approach, we further propose a practical variation based on batched sparse (BATS) code. The effectiveness of our proposed approaches is demonstrated by numerical simulations.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have