Distributed Encoding and Updating for SAZD Coded Distributed Training

Mingjun Dai,Xiaohui Lin,Hui Wang,Qingwen Huang,Jialong Yuan

doi:10.1109/tpds.2023.3276888

Abstract

Linear combination (LC) based coded distributed computing (CDC) suffers from the problem of poor numerical stability. Therefore, LC-CDC based model parallel (MP) training for a deep nueral network (DNN) may have poor accuracy. To enhance accuracy, we propose to replace LC by shift-and-addition (SA) and replace matrix inversion by zigzag decoding (ZD) in the encoding and decoding process of each layer, respectively, and call the scheme Naive SAZD-CDC based MP training (N- SAZD-CDC-MP-T). However, N-SAZD-CDC-MP-T encounters the problem of bottleneck at the master node, which is caused by frequent encoding/decoding at the master node and frequent huge volume of data delivery between master and worker node. This bottleneck problem may pull down the training speed significantly. To alleviate this bottleneck problem, we further design an enhanced version, by offloading certain processing from master node to distributed encoding and updating (DEU) at the worker nodes and call it DEU-SAZD-CDC-MP-T. A proof that DEU-SAZD-CDC-MP-T automatically maitains the code structure during each iteration is provided. Extensive numerical studies show that the prediction accuracy of SAZD-CDC-MP-T improves significantly over that of Poly (which is representative of LC) based scheme. In addition, the training speed of DEU-SAZD- CDC-MP-T over N-SAZD-CDC-MP-T is improved significantly.

Full Text