Abstract

The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in both the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.

Highlights

  • Machine learning (ML) is one of the most exciting and fastest-growing areas of modern research and application

  • These studies have found that in the absence of regularization, the bias reaches a minimum at the interpolation threshold and remains constant into the overparameterized regime. Of these studies, closed-form expressions using the standard definitions of bias and variance have only been obtained for the simple case of linear regression without basis functions and a linear data distribution [42]. While this setting captures some qualitative aspects of the double-descent phenomenon [see Fig. 1(a)], it requires a one-to-one correspondence between features in the data and fit parameters and a perfect match between the data distribution and model architecture, making it difficult to understand, if and how these results generalize to more complicated statistical models

  • We find that the training error, test error, bias, and variance take the general forms

Read more

Summary

INTRODUCTION

Machine learning (ML) is one of the most exciting and fastest-growing areas of modern research and application. Into three sources: bias (errors resulting from erroneous assumptions which can hamper a statistical model’s ability to fully express the patterns hidden in the data), variance (errors arising from oversensitivity to the particular choice of training set), and noise This bias-variance decomposition provides a natural intuition for understanding how complex a model must be in order to make accurate predictions on unseen data. If model complexity is increased past the interpolation threshold, the test error once again decreases, often resulting in overparameterized models with even better out-of-sample performance than their underparameterized counterparts [see Fig. 1(b)]. This double-descent behavior stands in stark contrast with the classical statistical intuition based on the biasvariance trade-off; both bias and variance appear to decrease past the interpolation threshold. Explaining the unexpected success of overparameterized models represents a fundamental problem in ML and modern statistics

Relation to previous work
Overview of approach
Summary of major results
Organization of paper
Supervised learning task
Data distribution (teacher model)
Model architectures (student models)
Linear regression
Random nonlinear features model
Fitting procedure
Model evaluation
Exact solutions
Hessian matrix
Derivation of closed-form solutions
BIAS-VARIANCE DECOMPOSITION
RESULTS
General solutions
UNDERSTANDING BIAS AND VARIANCE IN OVERPARAMETERIZED MODELS
Two sources of bias: imperfect models and incomplete exploration of features
Variance: overfitting stems from poorly sampled direction in space of feature
Biased models can interpret signal as noise
Interpolating is not the same as overfitting
Susceptibilities measure sensitivity to perturbations
Nonstandard bias-variance decompositions lead to incorrect interpretations of double-descent
CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.