Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks.

Shrihari Vasudevan

doi:10.3390/e22050560

Abstract

This paper demonstrates a novel approach to training deep neural networks using a Mutual Information (MI)-driven, decaying Learning Rate (LR), Stochastic Gradient Descent (SGD) algorithm. MI between the output of the neural network and true outcomes is used to adaptively set the LR for the network, in every epoch of the training cycle. This idea is extended to layer-wise setting of LR, as MI naturally provides a layer-wise performance metric. A LR range test determining the operating LR range is also proposed. Experiments compared this approach with popular alternatives such as gradient-based adaptive LR algorithms like Adam, RMSprop, and LARS. Competitive to better accuracy outcomes obtained in competitive to better time, demonstrate the feasibility of the metric and approach.

Highlights

Automated Machine Learning (AutoML) systems with Deep Neural Network (DNN) models are currently a very active research area [1] and key development goal being pursued by several major industry organizations, e.g., IBM, Google, Microsoft, etc
This paper explores the feasibility of using Mutual Information (MI) [5] as a metric to realize this objective
This paper explores the use MI-based metrics to automate the Learning Rate (LR) decay in Stochastic Gradient Descent (SGD) training of deep neural networks

Summary

Introduction

Automated Machine Learning (AutoML) systems with Deep Neural Network (DNN) models are currently a very active research area [1] and key development goal being pursued by several major industry organizations, e.g., IBM, Google, Microsoft, etc. Among the key problems that need to be addressed towards this goal is hyperparameter selection and adaptation through the training process. Hyperparameter selection in DNNs is mostly done by experimentation for different data sets and models. In AutoML systems [1], this is realized through various forms of search including grid search, random search, Bayesian optimization, etc. Stochastic Gradient Descent (SGD) optimization [2]. With mini-batches of data is a time-tested and efficient approach to optimizing the weights of a DNN. Hyperparameter selection and adaptation has a strong bearing on the outcomes of SGD-based training of DNN models. Established procedures to set the LR to a low value at the beginning and gradually warm up to the desired LR have been used effectively [3,4]

Methods

Findings

Discussion

Conclusion