Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models.

Jason W Rocks,Pankaj Mehta

doi:10.1103/physrevresearch.4.013201

Abstract

The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in both the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.

Highlights

Machine learning (ML) is one of the most exciting and fastest-growing areas of modern research and application
These studies have found that in the absence of regularization, the bias reaches a minimum at the interpolation threshold and remains constant into the overparameterized regime. Of these studies, closed-form expressions using the standard definitions of bias and variance have only been obtained for the simple case of linear regression without basis functions and a linear data distribution [42]. While this setting captures some qualitative aspects of the double-descent phenomenon [see Fig. 1(a)], it requires a one-to-one correspondence between features in the data and fit parameters and a perfect match between the data distribution and model architecture, making it difficult to understand, if and how these results generalize to more complicated statistical models
We find that the training error, test error, bias, and variance take the general forms

Summary

INTRODUCTION

Machine learning (ML) is one of the most exciting and fastest-growing areas of modern research and application. Into three sources: bias (errors resulting from erroneous assumptions which can hamper a statistical model’s ability to fully express the patterns hidden in the data), variance (errors arising from oversensitivity to the particular choice of training set), and noise This bias-variance decomposition provides a natural intuition for understanding how complex a model must be in order to make accurate predictions on unseen data. If model complexity is increased past the interpolation threshold, the test error once again decreases, often resulting in overparameterized models with even better out-of-sample performance than their underparameterized counterparts [see Fig. 1(b)]. This double-descent behavior stands in stark contrast with the classical statistical intuition based on the biasvariance trade-off; both bias and variance appear to decrease past the interpolation threshold. Explaining the unexpected success of overparameterized models represents a fundamental problem in ML and modern statistics

Relation to previous work

Overview of approach

Summary of major results

Organization of paper

Supervised learning task

Data distribution (teacher model)

Model architectures (student models)

Linear regression

Random nonlinear features model

Fitting procedure

Model evaluation

Exact solutions

Hessian matrix

Derivation of closed-form solutions

BIAS-VARIANCE DECOMPOSITION

RESULTS

General solutions

UNDERSTANDING BIAS AND VARIANCE IN OVERPARAMETERIZED MODELS

Two sources of bias: imperfect models and incomplete exploration of features

Variance: overfitting stems from poorly sampled direction in space of feature

Biased models can interpret signal as noise

Interpolating is not the same as overfitting

Susceptibilities measure sensitivity to perturbations

Nonstandard bias-variance decompositions lead to incorrect interpretations of double-descent

CONCLUSIONS

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Physical Review Research	Publication Date: Mar 15, 2022
Citations: 33	License type: CC BY 4.0

R Discovery Prime

Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Physical Review Research

Lead the way for us

Similar Papers

Bias-variance decomposition of overparameterized regression with random linear features.
Jason W Rocks ... Pankaj Mehta
Physical review. E | VOL. 106
Jason W Rocks, et. al.Jason W Rocks ... Pankaj Mehta
04 Aug 2022
Physical review. E | VOL. 106

How Good Is Crude MDL for Solving the Bias-Variance Dilemma? An Empirical Investigation Based on Bayesian Networks
Nicandro Cruz-Ramírez ... Efrén Mezura-Montes
PLoS ONE | VOL. 9
Nicandro Cruz-Ramírez, et. al.Nicandro Cruz-Ramírez ... Efrén Mezura-Montes
26 Mar 2014
PLoS ONE | VOL. 9

View Generalization for Single Image Textured 3D Models
Anand Bhattad ... Bryan Catanzaro
-
Anand Bhattad, et. al.Anand Bhattad ... Bryan Catanzaro
01 Jun 2021
01 Jun 2021

Approximation Algorithms for Training One-Node ReLU Neural Networks
Santanu S Dey ... Guanyi Wang
IEEE Transactions on Signal Processing | VOL. 68
Santanu S Dey, et. al.Santanu S Dey ... Guanyi Wang
01 Jan 2020
IEEE Transactions on Signal Processing | VOL. 68

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Physical Review Research