Deep learning: a statistical viewpoint

Andrea Montanari,Peter L. Bartlett,Alexander Rakhlin

doi:10.1017/s0962492921000027

Andrea Montanari, Peter L. Bartlett + Show 1 more

Open Access

https://doi.org/10.1017/s0962492921000027

Copy DOI

Abstract

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

Highlights

The past decade has witnessed dramatic advances in machine learning that have led to major breakthroughs in computer vision, speech recognition, and robotics
We have considered the statistical performance of the empirical risk minimizer ferm without considering the computational cost of solving this optimization problem
It is instructive to consider the implications of the generalization bounds we have reviewed for the phenomenon of benign overfitting, which has been observed in deep learning

Summary

Introduction

The past decade has witnessed dramatic advances in machine learning that have led to major breakthroughs in computer vision, speech recognition, and robotics. Deep learning reveals some major surprises from a theoretical perspective: deep learning methods can find near-optimal solutions to highly non-convex empirical risk minimization problems, solutions that give a near-perfect fit to noisy training data, but despite making no explicit effort to control model complexity, these methods lead to excellent prediction performance in practice. Deep learning exploits rich and expressive models, with many parameters, and the problem of optimizing the fit to the training data appears to simplify dramatically when the function class is rich enough, that is, when it is sufficiently overparametrized. The second surprising empirical discovery was that these models are outside the realm of uniform convergence They are enormously complex, with many parameters, they are trained with no explicit regularization to control their statistical complexity, and they typically exhibit a near-perfect fit to noisy training data, that is, empirical risk close to zero. It seems likely that depth is crucial for these issues of expressivity

Overview

Generalization and uniform convergence

Preliminaries

Uniform laws of large numbers

Faster rates

Complexity regularization

Computational complexity of empirical risk minimization

Classification

Large margin classification

Real prediction

The mismatch between benign overfitting and uniform convergence

Benign overfitting

Local methods

Linear regression in the interpolating regime

Linear regression in Reproducing Kernel Hilbert Spaces

The Laplace kernel with constant dimension

Kernels on Rd with d nα

Kernels on Rd with d n

Efficient optimization

The linear regime

Beyond the linear regime?

Other approaches

Generalization in the linear regime

The implicit regularization of gradient-based training

Ridge regression in the linear regime

Random features model

Polynomial scaling

Proportional scaling

Predicted test error

Neural tangent model

Conclusions and future directions

Bound on the variance of the minimum-norm interpolant

Exact characterization in the proportional asymptotics

An estimate on the entries of the resolvent

Consequences

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Acta Numerica	Publication Date: May 1, 2021
Citations: 52	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Deep learning: a statistical viewpoint

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Acta Numerica

Lead the way for us