Abstract

Training Deep Neural Networks (DNN) concurrently is becoming increasingly important for deep learning practitioners, e.g., <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">hyperparameter optimization (HPO)</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">neural architecture search (NAS)</i> . The GPU memory capacity is the impediment that prohibits multiple DNNs from being trained on the same GPU due to the large memory usage during training. In this paper, we propose Waterwave a GPU memory flow engine for concurrent deep learning training. Firstly, to address the memory explosion brought by the long time lag between memory allocation and deallocation time, we develop an allocator tailored for multi-streams. By making the allocator aware of the stream information, a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">prioritized allocation</i> is conducted based on the chunk's <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">synchronization</i> attributes, allowing us to provide useable memory after scheduling rather than waiting it to be really released after GPU computation. Secondly, Waterwave partitions the compute graph to a set of continuous <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">node groups</i> and then performs finer-grained scheduling: <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NodeGroup pipeline execution</i> , to guarantee a proper memory requests order. Waterwave can accomplish up to 96.8% of the maximum batch size of solo training. Additionally, in scenarios with high memory demand, Waterwave can outperform existing spatial sharing and temporal sharing by up to 12x and 1.49x, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call