Abstract

Inspired by the adaptation phenomenon of neuronal firing, we propose the regularity normalization (RN) as an unsupervised attention mechanism (UAM) which computes the statistical regularity in the implicit space of neural networks under the Minimum Description Length (MDL) principle. Treating the neural network optimization process as a partially observable model selection problem, the regularity normalization constrains the implicit space by a normalization factor, the universal code length. We compute this universal code incrementally across neural network layers and demonstrate the flexibility to include data priors such as top-down attention and other oracle information. Empirically, our approach outperforms existing normalization methods in tackling limited, imbalanced and non-stationary input distribution in image classification, classic control, procedurally-generated reinforcement learning, generative modeling, handwriting generation and question answering tasks with various neural network architectures. Lastly, the unsupervised attention mechanisms is a useful probing tool for neural networks by tracking the dependency and critical learning stages across layers and recurrent time steps of deep networks.

Highlights

  • The Minimum Description Length (MDL) principle asserts that the best model given some data is the one that minimizes the combined cost of describing the model and describes the misfit between the model and data [1] with a goal to maximize regularity extraction for optimal data compression, prediction and communication [2]

  • If we consider the activations from each layer of a neural network as the population codes, the constraint space can be subdivided into the input-vector space, the hidden-vector space, and the implicit space, which represents the underlying dimensions of variability in the other two spaces, i.e., a reduced representation of the constraint space

  • LN+RN which is a combined approach where the regularity normalization is applied after the layer normalization

Read more

Summary

Introduction

The Minimum Description Length (MDL) principle asserts that the best model given some data is the one that minimizes the combined cost of describing the model and describes the misfit between the model and data [1] with a goal to maximize regularity extraction for optimal data compression, prediction and communication [2]. If we consider the neural network training as the optimization process of a communication system, each input at each layer of the system can be described as a point in a low-dimensional continuous constraint space [4]. The minimum code length given any arbitrary θ would be given by L( x |θ ( x )) = − log P( x |θ ( x )) with model θ ( x ) which compresses data sample x most efficiently and offers maximum likelihood P( x |θ (ˆx )) [2]. The compressibility of the model, computed as the minimum code length, can be unattainable for multiple non-i.i.d. data samples as individual inputs, as the probability distributions of most efficiently representing a certain data sample x given a certain model class can vary from sample to sample. The solution relies on the existence of a universal code, P( x ) defined for a model class Θ, such that for any data sample x, the shortest code for x is always L( x |θ ( x )), as shown in [27]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call