Abstract
Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. Binarized activation offers an additional computational saving for inference. Due to vanishing gradient issue in training networks with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of a one hidden layer convolutional neural network (CNN) with binarized activation function and sparse weights. It is known that when the input data is Gaussian distributed, no-overlap one hidden layer CNN with ReLU activation and general weight can be learned by GD in polynomial time at high probability in regression problems with ground truth. We propose a relaxed variable splitting method integrating thresholding and coarse gradient descent. The sparsity in network weight is realized through thresholding during the CGD training process. We prove that under threshholding of l_1, l_0, and transformed-l_1 penalties, no-overlap binary activation CNN can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel sparsifying operation. We found explicit error estimates of sparse weights from the true weights.
Highlights
Deep neural networks (DNN) have achieved state-of-the-art performance on many machine learning tasks such as speech recognition [1], computer vision [2], and natural language processing [3]
We propose a Relaxed Variable Splitting (RVS) approach combining thresholding and coarse gradient descent (CGD) for minimizing the following augmented objective function β Lβ (u, w) = f (w) + λ P(u) + 2 w−u 2 for a positive parameter β
We shall prove that our algorithm (RVSCGD), alternately minimizing u and w, converges for l0, l1, and Tl1 penalties to a global limit (w, u ) with high probability
Summary
Deep neural networks (DNN) have achieved state-of-the-art performance on many machine learning tasks such as speech recognition [1], computer vision [2], and natural language processing [3]. Several works [6, 7] focused on the geometric properties of loss functions, which is made possible by assuming that the input data distribution is Gaussian They showed that SGD with random or zero initialization is able to train a no-overlap neural network in polynomial time. A surrogate l0 regularization approach based on a continuous relaxation of Bernoulli random variables in the distribution sense is introduced with encouraging results on small size image data sets [13] This motivated our work here to study deterministic regularization of l0 via its Moreau envelope and related l1 penalties in a one hidden layer convolutional neural network model [7]. As pointed out in Louizos et al [13], it is beneficial to attain sparsity during the optimization (training) process
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.