Variance-Guided Structured Sparsity in Deep Neural Networks

Mohammad Khalid Pandit,Mahroosh Banday

doi:10.1109/tai.2022.3221688

Abstract

The success of deep neural networks, especially convolutional neural networks in various applications, has greatly been possible by the presence of an enormous number of learnable parameters. These parameters increase the learning capacity of the model, but at the same time, it also significantly increases the computational and memory costs. This severely hinders the scalability of these models to limited resource environments, like IoT devices. The majority of the network weights are known to be redundant and can be removed from the network. This paper introduces a regularization scheme, which is the combination of structured sparsity regularization and variance regularization. It simultaneously helps to obtain computationally sparse models by making the majority of parameter groups zero and increasing the variance of non-zero groups to compensate for the accuracy drop. We use sparse group lasso, group sparsity variant of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\ell _{1}$</tex-math></inline-formula> (lasso) regularization, which removes redundant connections and unnecessary neurons from the network. For variance regularization, the KL divergence of the current parameter distribution and the target distribution is minimized, which aims to have the concentration of weights towards zero and a high variance of non-zero weights (skewed distribution). To check the effectiveness of the proposed regularizer, the experiments are performed on various benchmark datasets and it is noticed that variance regularization helps to reduce the accuracy drop caused by sparsity regularization. On MNIST, the trainable parameters are reduced from 331984 (baseline model) to 57327 and managed to obtain better accuracy than the baseline (99.6%). Also, on Fashion-MNIST, Cifar-10 and ImageNet the proposed scheme achieved state-of-the-art sparsity with almost no drop in accuracy.

Full Text