Abstract
Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration of optimal performance. In response, we propose ATP, a recomputation and swapping based GPU memory management framework that aims to maximize training performance by breaking GPU memory constraints. ATP utilizes a throughput model we proposed to evaluate the theoretical peak performance achievable by DNN training on GPU, and provide the optimum memory size required for recomputation and swapping. We optimize the mechanisms for GPU memory pool and CUDA stream control, employs an optimization method to search for specific tensors requiring recomputation and swapping, thereby bringing the actual DNN training performance on ATP closer to theoretical values. Evaluations with different types of large DNN models indicate that ATP achieve throughput improvements ranging from 1.14 ∼ 1.49 ×, while support model training exceeding the GPU memory limit by up to 9.2 ×.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Architecture and Code Optimization
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.