Abstract
Artificial neural network training with gradient descent can be destabilized by ‘bad batches’ with high losses. This is often problematic for training with small batch sizes, high order loss functions or unstably high learning rates. To stabilize learning, we have developed adaptive learning rate clipping (ALRC) to limit backpropagated losses to a number of standard deviations above their running means. ALRC is designed to complement existing learning algorithms: Our algorithm is computationally inexpensive, can be applied to any loss function or batch size, is robust to hyperparameter choices and does not affect backpropagated gradient distributions. Experiments with CIFAR-10 supersampling show that ALCR decreases errors for unstable mean quartic error training while stable mean squared error training is unaffected. We also show that ALRC decreases unstable mean squared errors for scanning transmission electron microscopy supersampling and partial scan completion. Our source code is available at https://github.com/Jeffrey-Ede/ALRC.
Highlights
This paper addresses loss spikes, one of the most common reasons for low performance in artificial neural networks trained with stochastic gradient descent[1] (SGD)
adaptive learning rate clipping (ALRC) has no effect on mean squared error (MSE) training, even for batch size 1
Taken together, our CIFAR-10 supersampling results show that ALRC improves stability and lowers losses for learning that would be destabilized by loss spikes and otherwise has little effect
Summary
This paper addresses loss spikes, one of the most common reasons for low performance in artificial neural networks trained with stochastic gradient descent[1] (SGD). Gradients backpropagated from high losses can excessively perturb trainable parameter distributions and destabilize learning. An example of loss spikes destabilizing learning is shown in fig. 1. Loss spikes are common for small batch sizes, high order loss functions and unstably high learning rates. During neural network training with vanilla SGD, a trainable parameter, θt, from step t is updated to θt+1 in step t + 1. The size of the update is given by the product of a learning rate, η, and the backpropagated gradient of a loss function with respect to the trainable parameter
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.