Entropic Dynamics for Learning in Neural Networks and the Renormalization Group

Nestor Caticha

doi:10.3390/proceedings2019033010

Abstract

We study the dynamics of information processing in the continuous depth limit of deep feed-forward Neural Networks (NN) and find that it can be described in language similar to the Renormalization Group (RG). The association of concepts to patterns by NN is analogous to the identification of the few variables that characterize the thermodynamic state obtained by the RG from microstates. We encode the information about the weights of a NN in a Maxent family of distributions. The location hyper-parameters represent the weights estimates. Bayesian learning of new examples determine new constraints on the generators of the family, yielding a new pdf and in the ensuing entropic dynamics of learning, hyper-parameters change along the gradient of the evidence. For a feed-forward architecture the evidence can be written recursively from the evidence up to the previous layer convoluted with an aggregation kernel. The continuum limit leads to a diffusion-like PDE analogous to Wilson’s RG but with an aggregation kernel that depends on the the weights of the NN, different from those that integrate out ultraviolet degrees of freedom. Approximations to the evidence can be obtained from solutions of the RG equation. Its derivatives with respect to the hyper-parameters, generate examples of Entropic Dynamics in Neural Networks Architectures (EDNNA) learning algorithms. For simple architectures, these algorithms can be shown to yield optimal generalization in student- teacher scenarios.

Highlights

Neural networks are information processing systems that learn from examples
Neural networks are parametric models and if we don’t address the determination of the architecture, which we don’t in this paper, the problem of learning from examples is reduced to obtaining fast estimates of the weights or parameters, avoiding the integration over large dimensional spaces
Theoretical analysis is easier than for batch or off-line learning where the cost function depends on a large number of example pairs, on-line accuracy performance remains high

Summary

Introduction

Neural networks are information processing systems that learn from examples. Loosely inspired in biological neural systems, they have been used for several types of problems such as classification, regression, dimensional reduction and clustering [1]. Theoretical analysis is easier than for batch or off-line learning where the cost function depends on a large number of example pairs, on-line accuracy performance remains high. The denominator of the Bayes update can be interpreted either as the evidence of the model or alternatively as the predictive probability distribution of the output conditioned on the input and the weights. Once it is written as the marginalization over the internal representation, i.e. the activation values of the internal units, of the joint distribution of activities of the whole network, and under the supposition that the information flows only from one layer to the a Markov chain structure follows. The first authors to relate the RG and NN were [3] and [4] generating a large flow of ideas into the possible connections between these two areas [5,6,7]

Maxent Distributions and Bayesian Learning

Deep Multilayer Perceptron