Abstract The proliferation of extensive neural network architectures, particularly deep learning models, presents a challenge in terms of resource-intensive training. GPU memory constraints have become a notable bottleneck in training such sizable models. Existing strategies, including data parallelism, model parallelism, pipeline parallelism, and fully sharded data parallelism, offer partial solutions. Model parallelism, in particular, enables the distribution of the entire model across multiple GPUs, yet the ensuing data communication between these partitions slows down training. Instead of using the entire model for training, this study advocates partitioning the model across GPUs and generating synthetic intermediate labels to train individual segments. These labels, produced through a random process, mitigate memory overhead and computational load. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy. The method is validated using 6-layer fully-connected networks, via the extended MNIST, CIFAR10 and CIFAR100 datasets. It is shown that the computational improvement to reach 90% of the cross-yield accuracy can be as high as 66%. Additionally, the improvement in training bandwidth compared to standard model parallelism is quantitatively demonstrated through an example scenario. This work contributes to mitigating the resource-intensive nature of training large neural networks, paving the way for more efficient deep learning model development.
Read full abstract