Adaptive learning rate clipping stabilizes learning

Jeffrey M Ede,Richard Beanland

doi:10.1088/2632-2153/ab81e2

Abstract

Artificial neural network training with gradient descent can be destabilized by ‘bad batches’ with high losses. This is often problematic for training with small batch sizes, high order loss functions or unstably high learning rates. To stabilize learning, we have developed adaptive learning rate clipping (ALRC) to limit backpropagated losses to a number of standard deviations above their running means. ALRC is designed to complement existing learning algorithms: Our algorithm is computationally inexpensive, can be applied to any loss function or batch size, is robust to hyperparameter choices and does not affect backpropagated gradient distributions. Experiments with CIFAR-10 supersampling show that ALCR decreases errors for unstable mean quartic error training while stable mean squared error training is unaffected. We also show that ALRC decreases unstable mean squared errors for scanning transmission electron microscopy supersampling and partial scan completion. Our source code is available at https://github.com/Jeffrey-Ede/ALRC.

Highlights

This paper addresses loss spikes, one of the most common reasons for low performance in artificial neural networks trained with stochastic gradient descent[1] (SGD)
adaptive learning rate clipping (ALRC) has no effect on mean squared error (MSE) training, even for batch size 1
Taken together, our CIFAR-10 supersampling results show that ALRC improves stability and lowers losses for learning that would be destabilized by loss spikes and otherwise has little effect

Summary

Introduction

This paper addresses loss spikes, one of the most common reasons for low performance in artificial neural networks trained with stochastic gradient descent[1] (SGD). Gradients backpropagated from high losses can excessively perturb trainable parameter distributions and destabilize learning. An example of loss spikes destabilizing learning is shown in fig. 1. Loss spikes are common for small batch sizes, high order loss functions and unstably high learning rates. During neural network training with vanilla SGD, a trainable parameter, θt, from step t is updated to θt+1 in step t + 1. The size of the update is given by the product of a learning rate, η, and the backpropagated gradient of a loss function with respect to the trainable parameter

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Machine Learning: Science and Technology	Publication Date: Mar 1, 2020
Citations: 25	License type: cc-by

R Discovery Prime

R Discovery Prime

Adaptive learning rate clipping stabilizes learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Machine Learning: Science and Technology

Lead the way for us

Similar Papers

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation
Simon Eberle ... Adrian Riekert
Electronic Research Archive | VOL. 31
Simon Eberle, et. al.Simon Eberle ... Adrian Riekert
01 Jan 2023
Electronic Research Archive | VOL. 31

Taming the Noisy Gradient: Train Deep Neural Networks with Small Batch Sizes
Yikai Zhang ... Chao Chen
-
Yikai Zhang, et. al.Yikai Zhang ... Chao Chen
01 Aug 2019
01 Aug 2019

Momentum Batch Normalization for Deep Learning with Small Batch Size
Hongwei Yong ... Deyu Meng
-
Hongwei Yong, et. al.Hongwei Yong ... Deyu Meng
01 Jan 2020
01 Jan 2020

Crossbow
Alexandros Koliousis ... Luo Mai
Proceedings of the VLDB Endowment | VOL. 12
Alexandros Koliousis, et. al.Alexandros Koliousis ... Luo Mai
01 Jul 2019
Proceedings of the VLDB Endowment | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Adaptive learning rate clipping stabilizes learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Machine Learning: Science and Technology