Variational Characterizations of Local Entropy and Heat Regularization in Deep Learning.

Nicolas García Trillos,Daniel Sanz-Alonso,Zachary Kaplan

doi:10.3390/e21050511

Nicolas García Trillos, Daniel Sanz-Alonso + Show 1 more

Open Access

https://doi.org/10.3390/e21050511

Copy DOI

Abstract

The aim of this paper is to provide new theoretical and computational understanding on two loss regularizations employed in deep learning, known as local entropy and heat regularization. For both regularized losses, we introduce variational characterizations that naturally suggest a two-step scheme for their optimization, based on the iterative shift of a probability density and the calculation of a best Gaussian approximation in Kullback–Leibler divergence. Disregarding approximation error in these two steps, the variational characterizations allow us to show a simple monotonicity result for training error along optimization iterates. The two-step optimization schemes for local entropy and heat regularized loss differ only over which argument of the Kullback–Leibler divergence is used to find the best Gaussian approximation. Local entropy corresponds to minimizing over the second argument, and the solution is given by moment matching. This allows replacing traditional backpropagation calculation of gradients by sampling algorithms, opening an avenue for gradient-free, parallelizable training of neural networks. However, our presentation also acknowledges the potential increase in computational cost of naive optimization of regularized costs, thus giving a less optimistic view than existing works of the gains facilitated by loss regularization.

Highlights

The development and assessment of optimization methods for the training of deep neural networks has brought forward novel questions that call for new theoretical insights and computational techniques [1]
While the use of importance sampling opens an avenue for gradient-free, parallelizable training of neural networks, our numerical experiments will show that naive implementation without parallelization gives poor performance relative to stochastic gradient Langevin dynamics (SGLD) or plain stochastic gradient descent (SGD)
Information theory in deep learning: We have introduced information theoretic, variational characterizations of two loss regularizations that have received recent attention in the deep learning community

Summary

Introduction

The development and assessment of optimization methods for the training of deep neural networks has brought forward novel questions that call for new theoretical insights and computational techniques [1]. Our third contribution is to perform a numerical case-study to assess the performance of various implementations of the two-step iterative optimization of local entropy and heat regularized functionals. These implementations differ in how the minimization of Kullback–Leibler is computed and the argument that is minimized. They suggest that for moderate-sized architectures, where the best Kullback–Leibler Gaussian approximations can be computed effectively, the generalization error with regularized losses is more stable than for stochastic gradient descent over the original loss.

Background

Background on Local Entropy Regularization

Background on Heat Regularization

Notation

Local Entropy

Two-Step Iterative Optimization

Majorization-Minorization and Monotonicity

Heat Regularization

Gaussian Kullback–Leibler Minimization

Stochastic Gradient Langevin Dynamics

Importance Sampling

Numerical Experiments

Network Specification

Training Neural Networks from Random Initialization

Local Entropy Regularization after SGD

Algorithm Stability and Monotonicity

Choosing τ

Findings

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Entropy	Publication Date: May 20, 2019
Citations: 18	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Variational Characterizations of Local Entropy and Heat Regularization in Deep Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy

Lead the way for us

Similar Papers

Nested Variational Chain and Its Application in Massive MIMO Detection for High-Order Constellations.
Qiwei Wang
Entropy | VOL. 25
Qiwei WangQiwei Wang
05 Dec 2023
Entropy | VOL. 25

Trust region policy optimization via entropy regularization for Kullback–Leibler divergence constraint
Haotian Xu ... Jie Lu
Neurocomputing | VOL. 589
Haotian Xu, et. al.Haotian Xu ... Jie Lu
16 Apr 2024
Neurocomputing | VOL. 589

On Optimality of Gamma Approximation for Lognormal Shadowing Models
Shuping Dang ... Jia Ye
Antennas and Wireless Propagation Letters | VOL. 22
Shuping Dang, et. al.Shuping Dang ... Jia Ye
01 May 2023
Antennas and Wireless Propagation Letters | VOL. 22

The Significance Detection of Pulmonary Nodules Based on Local Entropy
...
-
, et. al. ...
29 Sep 2020
29 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Variational Characterizations of Local Entropy and Heat Regularization in Deep Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy