Abstract

As deep neural networks grow in size, from thousands to millions to billions of weights, the performance of those networks becomes limited by our ability to accurately train them. A common naive question arises: if we have a system with billions of degrees of freedom, don't we also need billions of samples to train it? Of course, the success of deep learning indicates that reliable models can be learned with reasonable amounts of data. Similar questions arise in protein folding, spin glasses and biological neural networks. With effectively infinite potential folding/spin/wiring configurations, how does the system find the precise arrangement that leads to useful and robust results? Simple sampling of the possible configurations until an optimal one is reached is not a viable option even if one waited for the age of the universe. On the contrary, there appears to be a mechanism in the above phenomena that forces them to achieve configurations that live on a low-dimensional manifold, avoiding the curse of dimensionality. In the current work we use the concept of mutual information between successive layers of a deep neural network to elucidate this mechanism and suggest possible ways of exploiting it to accelerate training. We show that adding structure to the neural network leads to higher mutual information between layers. High mutual information between layers implies that the effective number of free parameters is exponentially smaller than the raw number of tunable weights, providing insight into why neural networks with far more weights than training points can be reliably trained.

Highlights

  • Artificial neural networks with millions, or even billions (Shazeer et al, 2017), of weights provide neurons and synapses comparable with computational complexity approaching small animals (Goodfellow et al, 2016)

  • multilayer perceptrons (MLPs) with shortcuts start with higher mutual information which decreases toward the optimum

  • We say “correlation,” we precisely measured this redundancy using mutual information, which is invariant under arbitrary invertible nonlinearities

Read more

Summary

Introduction

Artificial neural networks with millions, or even billions (Shazeer et al, 2017), of weights provide neurons and synapses comparable with computational complexity approaching small animals (Goodfellow et al, 2016). Scientists have begun using them to test and compare many hypotheses in cognitive science (Phillips and Hodas, 2017). Some work has begun to explore how these complex systems reach such finely balanced solutions. Some have addressed how, given that the space of possible functions is so large, can any finite computational stage do a good job approximating physical systems (Lin et al, 2017). Given the computational power available in modern GPUs, we may explore these artificial neural networks to better understand how such highly interconnected computational graphs transfer information to quickly reach global optima. From a cognitive science perspective, the converse question remains, how is it that these complex systems can be trained with only a reasonable amount of data (vastly less than the complexity of the systems would suggest)? Given the computational power available in modern GPUs, we may explore these artificial neural networks to better understand how such highly interconnected computational graphs transfer information to quickly reach global optima.

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.