Doing the Impossible: Why Neural Networks Can Be Trained at All.

Nathan O Hodas,Panos Stinis

doi:10.3389/fpsyg.2018.01185

Nathan O Hodas, Panos Stinis

Open Access

PDF Available

https://doi.org/10.3389/fpsyg.2018.01185

Copy DOI

Export

Save

Cite

Journal: Frontiers in Psychology	Publication Date: Jul 12, 2018
Citations: 17	License type: CC BY 4.0

Affiliation: Pacific Northwest National Laboratory

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

As deep neural networks grow in size, from thousands to millions to billions of weights, the performance of those networks becomes limited by our ability to accurately train them. A common naive question arises: if we have a system with billions of degrees of freedom, don't we also need billions of samples to train it? Of course, the success of deep learning indicates that reliable models can be learned with reasonable amounts of data. Similar questions arise in protein folding, spin glasses and biological neural networks. With effectively infinite potential folding/spin/wiring configurations, how does the system find the precise arrangement that leads to useful and robust results? Simple sampling of the possible configurations until an optimal one is reached is not a viable option even if one waited for the age of the universe. On the contrary, there appears to be a mechanism in the above phenomena that forces them to achieve configurations that live on a low-dimensional manifold, avoiding the curse of dimensionality. In the current work we use the concept of mutual information between successive layers of a deep neural network to elucidate this mechanism and suggest possible ways of exploiting it to accelerate training. We show that adding structure to the neural network leads to higher mutual information between layers. High mutual information between layers implies that the effective number of free parameters is exponentially smaller than the raw number of tunable weights, providing insight into why neural networks with far more weights than training points can be reliably trained.

Highlights

Artificial neural networks with millions, or even billions (Shazeer et al, 2017), of weights provide neurons and synapses comparable with computational complexity approaching small animals (Goodfellow et al, 2016)
multilayer perceptrons (MLPs) with shortcuts start with higher mutual information which decreases toward the optimum
We say “correlation,” we precisely measured this redundancy using mutual information, which is invariant under arbitrary invertible nonlinearities

Summary

Introduction

Artificial neural networks with millions, or even billions (Shazeer et al, 2017), of weights provide neurons and synapses comparable with computational complexity approaching small animals (Goodfellow et al, 2016). Scientists have begun using them to test and compare many hypotheses in cognitive science (Phillips and Hodas, 2017). Some work has begun to explore how these complex systems reach such finely balanced solutions. Some have addressed how, given that the space of possible functions is so large, can any finite computational stage do a good job approximating physical systems (Lin et al, 2017). Given the computational power available in modern GPUs, we may explore these artificial neural networks to better understand how such highly interconnected computational graphs transfer information to quickly reach global optima. From a cognitive science perspective, the converse question remains, how is it that these complex systems can be trained with only a reasonable amount of data (vastly less than the complexity of the systems would suggest)? Given the computational power available in modern GPUs, we may explore these artificial neural networks to better understand how such highly interconnected computational graphs transfer information to quickly reach global optima.

Objectives

Methods

Results

Conclusion