To accommodate the increasingly large-scale models within limited-capacity GPU memory, various coarse-grained techniques, such as recomputation and swapping, have been proposed to optimize memory usage. However, these methods have encountered limitations, either in terms of inefficient memory reduction or diminished training performance. In response to this, our paper introduces DELTA, an innovative approach for memory-efficient large-scale model training that combines fine-grained memory optimization and prefetching technology to reduce memory usage while maintaining high training throughput concurrently. Initially, we formulate the problem of memory-throughput joint optimization as an easy-solving 0/1 Knapsack problem. Leveraging this formalization, we use an improving polynomial complexity heuristic algorithm to address the problem effectively. Furthermore, we introduce a novel bidirectional prefetching technology into dynamic memory management, which significantly accelerates the model training when compared to relying solely on recomputation or swapping. Finally, DELTA offers users an automated training execution library, eliminating the need for manual configuration or specialized expertise. Experimental results demonstrate the effectiveness of DELTA in reducing GPU memory consumption. Compared to state-of-the-art methods, DELTA achieves substantial memory savings ranging from 40% to 72%, while maintaining comparable convergence performance for various models, including ResNet-50, ResNet-101, and BERT-Large. Notably, DELTA enables the training of GPT2-Large and GPT2-XL with batch sizes increased by 5.5 × and 6 ×, respectively, showcasing its versatility and practicality in enabling large-scale model training on GPU hardware.
Read full abstract