STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN Training

Zan Zong,Leilei Lin,Yu Sun,Lijie Wen,Li Lin

doi:10.1109/tpds.2023.3266110

Abstract

With the growth of the depth of neural networks and the scale of data, the difficulty of network training also increases. When the GPU memory is insufficient, it is challenging to train deeper models. Recent research uses tensor swapping and recomputation techniques in a combined manner to optimize memory usage. However, complex dependencies and enormous scales of the DNN graph limit the improvement of single GPU memory optimization. Improper swap and recomputation decisions even bring negative effects on training performance. In this paper, we propose a novel hybrid tensor re-generation strategy, called STR, which combines swap and recomputation techniques to find the optimal execution plan for the DNN training when the memory is limited. We formalize our memory optimization problem with constraints that describe the dependency of the operator calculation and the bandwidth usage of the swap. A <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">host checkpoint</i> mechanism is designed to make full use of the swapped tensors, which reduces the cost of the recomputation. We also present a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">recursive source tracing</i> algorithm to improve the optimization efficiency by constraint relaxation with a performance bound. To optimize large models, we further introduce an approximation method based on a weighted graph coarsening. We implement a prototype of STR as a plugin on TensorFlow and evaluated based on 5 popular DNN models. The experimental result shows that the approximate solution of STR improves the training throughput of ResNet series of models by up to 28.1% compared to the state-of-the-art hybrid optimization strategy.

Full Text