Abstract

Distributed deep learning is a typical machine learning method running in distributed environment such as cloud computing systems. The corresponding training, validation and test datasets are very large in general (e.g., several TBs), which need to be stored across multiple data nodes. Due to the high disk failure ratio in cloud storage systems, one of the critical issues for distributed deep learning is how to efficiently tolerate disk failures in the training procedures. These failures can lead to a large amount of data loss, which decreases the training accuracy and slows down the training process. Although several recovery methods are proposed to accelerate the data reconstruction, the related overhead is extremely high, such as high CPU/GPU utilization, a large number of I/Os, etc.To address the above problems, we propose a novel Partial-Recovery Method (called PRM) , which is an adaptive recovery method to accelerate data reconstruction for distributed deep learning applications in cloud storage systems. The key idea of PRM is combining the advantages of erasure coding’s ability to obtain global information on the data distribution with the AI’s ability to recover partial lost data, which can sharply reduce the overhead with acceptable training accuracy. To demonstrate the effectiveness of the PRM approach, we conduct several experiments. The results show that, compared to the state-of-the-art full or approximate recovery methods, PRM decreases the average network transmission time overhead by up to 64.50%, and reduces the recovery time by up to 55.90%, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.