Abstract

Graphic Processing Units (GPUs) have limited memory capacity. Training popular deep neural networks (DNNs) often requires a larger amount of memory than that a GPU may have. Consequently, training data needs to be swapped between CPUs and GPUs. Data swapping may become a bottleneck when its latency is longer than the latency of DNN computations. Tensor compression in GPUs can reduce the data swapping time. However, existing works on compressing tensors in the virtual memory of GPUs have two major issues: sub-optimal compression performance for varying tensor sparsity and sizes and lack of portability because its implementation requires additional (de)compression units in memory controllers. We propose a self-tuning tensor compression framework, named CSWAP, for improving the virtual memory management of GPUs. It has high portability and is minimally dependent on GPU architecture features. Furthermore, its runtime only applies compression on tensors that are deemed to be cost-effective considering their sparsity and size and the characteristics of compression algorithms. Finally, our framework is fully automated and can customize the compression policy for different neural network architectures and GPU architectures. Our experimental results using six representative memory-intensive DNN models show that CSWAP reduces tensor swapping latency by up to 50.9% and reduces the DNN training time by 20.7% on average with NVIDIA V100 GPUs compared to vDNN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call