To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a multi-iteration asynchronous pipeline parallel scheduling was established for model parallel computing tasks, controlling the specific scheduling process of micro-batch units to address gradient delay updating during asynchronous iteration. Secondly, combined with the given network model and hardware configuration, a dynamic programming strategy for computing resources and model tasks was designed to achieve dynamic segmentation of model computing tasks and optimal matching of computing resources. Finally, an optimization strategy for runtime scheduling of computing resources and model tasks was developed, using improved device streams to maximize the overlap between computing and communication, thus improving the utilization rate of computing resources and reducing training time. Experimental results show that the APapo method achieves fine-grained task segmentation, maximizes the utilization rate of each GPU computing resource, and on average improves the training speed of large-scale deep neural network models by 2.8 times while maintaining the training accuracy of the model compared to existing parallel optimization methods.
Read full abstract