Robust learning of parsimonious deep neural networks

Valentin Frank Ingmar Guenter,Athanasios Sideris

doi:10.1016/j.neucom.2023.127011

Abstract

We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Simultaneous learning and pruning presents serious challenges such as premature pruning of units that can lead to poor performance. Our method is capable of overcoming such challenges and is robust, i.e., it gives consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. Thus, it allows for substantial computational cost savings during training, besides that of inference, and it can enable the training of very deep networks when transfer learning to obtain fully trained deep networks that can be pruned after training is not possible. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights. The variational posterior distribution of Bernoulli random variables multiplying the units/filters is learned, similarly to adaptive dropout. We construct a novel hyper-prior distribution over the prior parameters to impose properties crucial for their optimal selection and the overall robustness of our algorithm. It is shown in the context of our algorithm that the parameters of the posterior distributions practically converge to either 0 or 1, establishing a deterministic final network. Convergence is proved analytically based on dynamical systems theory and from the theoretical results, practical pruning conditions are established. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet data sets and the commonly used fully connected, convolutional and residual architectures LeNet, VGG16 and ResNet. The simulations show that our method typically achieves better pruning levels while maintaining test-accuracy on par with state-of the-art methods for structured pruning in a manner robust with respect to network initialization and initial size.

Full Text