PRM: An Efficient Partial Recovery Method to Accelerate Training Data Reconstruction for Distributed Deep Learning Applications in Cloud Storage Systems

Piao Hu,Ranhao Jia,Minyi Guo,Chentao Wu,Yunfei Gu,Jie Li

doi:10.1109/iwqos54832.2022.9812919

Abstract

Distributed deep learning is a typical machine learning method running in distributed environment such as cloud computing systems. The corresponding training, validation and test datasets are very large in general (e.g., several TBs), which need to be stored across multiple data nodes. Due to the high disk failure ratio in cloud storage systems, one of the critical issues for distributed deep learning is how to efficiently tolerate disk failures in the training procedures. These failures can lead to a large amount of data loss, which decreases the training accuracy and slows down the training process. Although several recovery methods are proposed to accelerate the data reconstruction, the related overhead is extremely high, such as high CPU/GPU utilization, a large number of I/Os, etc.To address the above problems, we propose a novel Partial-Recovery Method (called PRM) , which is an adaptive recovery method to accelerate data reconstruction for distributed deep learning applications in cloud storage systems. The key idea of PRM is combining the advantages of erasure coding’s ability to obtain global information on the data distribution with the AI’s ability to recover partial lost data, which can sharply reduce the overhead with acceptable training accuracy. To demonstrate the effectiveness of the PRM approach, we conduct several experiments. The results show that, compared to the state-of-the-art full or approximate recovery methods, PRM decreases the average network transmission time overhead by up to 64.50%, and reduces the recovery time by up to 55.90%, respectively.

Full Text