ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management

Weiduo Chen,Xiaoshe Dong,Fan Zhang,Bowen Li,Yufei Wang,Qiang Wang

doi:10.1145/3701996

Weiduo Chen, Xiaoshe Dong + Show 4 more

Open Access

PDF Available

https://doi.org/10.1145/3701996

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing researches partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration of optimal performance. In response, we propose ATP, a recomputation and swapping based GPU memory management framework that aims to maximize training performance by breaking GPU memory constraints. ATP utilizes a throughput model we proposed to evaluate the theoretical peak performance achievable by DNN training on GPU, and provide the optimum memory size required for recomputation and swapping. We optimize the mechanisms for GPU memory pool and CUDA stream control, employs an optimization method to search for specific tensors requiring recomputation and swapping, thereby bringing the actual DNN training performance on ATP closer to theoretical values. Evaluations with different types of large DNN models indicate that ATP achieve throughput improvements ranging from 1.14 ∼ 1.49 ×, while support model training exceeding the GPU memory limit by up to 9.2 ×.

Full Text