Scaling description of generalization with number of parameters in deep learning

Mario Geiger,Clément Hongler,Arthur Jacot,Levent Sagun,Franck Gabriel,Giulio Biroli,Stefano Spigler,Matthieu Wyart,Stéphane D’Ascoli

doi:10.1088/1742-5468/ab633c

Abstract

Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as N grows past a certain threshold N*. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation . These affect the generalization error for classification: under natural assumptions, it decays to a plateau value in a power-law fashion ∼N−1/2. This description breaks down at a so-called jamming transition N = N*. At this threshold, we argue that diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N*. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N*, and averaging their outputs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Statistical Mechanics: Theory and Experiment	Publication Date: Feb 1, 2020
Citations: 139	License type: iop-standard

R Discovery Prime

R Discovery Prime

Scaling description of generalization with number of parameters in deep learning

Abstract

Talk to us

Similar Papers

More From: Journal of Statistical Mechanics: Theory and Experiment

Lead the way for us

Similar Papers

An Optimal Transport Analysis on Generalization in Deep Learning.
Jingwei Zhang ... Dacheng Tao
IEEE Transactions on Neural Networks and Learning Systems | VOL. 34
Jingwei Zhang, et. al.Jingwei Zhang ... Dacheng Tao
01 Jun 2023
IEEE Transactions on Neural Networks and Learning Systems | VOL. 34

Parameter convergence and learning curves for neural networks
T.L Fine ... S Mukherjee
-
T.L Fine, et. al.T.L Fine ... S Mukherjee
16 Aug 1998
16 Aug 1998

Parameter convergence and learning curves for neural networks.
Terrence L Fine ... Sayandev Mukherjee
Neural computation | VOL. 11
Terrence L Fine, et. al.Terrence L Fine ... Sayandev Mukherjee
01 Apr 1999
Neural computation | VOL. 11

Neural network modelling of flow stress and mechanical properties for hot strip rolling of TRIP steel using efficient learning algorithm
S K Das
Ironmaking & Steelmaking | VOL. 40
S K DasS K Das
01 May 2013
Ironmaking & Steelmaking | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scaling description of generalization with number of parameters in deep learning

Abstract

Talk to us

Similar Papers

More From: Journal of Statistical Mechanics: Theory and Experiment