Abstract

We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than $t^{-4/(d-2)}$ under mean field scaling. The loss functional is mean squared error with a Lipschitz-continuous target function and data distributed uniformly on the $d$ -dimensional unit cube. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks. Impact Statement –Artificial neural networks perform well in many real life applications, but may suffer from the curse of dimensionality on certain problems. We provide theoretical and numerical evidence that this may be related to whether a target function lies in the hypothesis class described by infinitely wide networks. The training dynamics are considered in the fully non-linear regime and not reduced to neural tangent kernels. We believe that it will be essential to study these hypothesis classes in detail to choose an appropriate machine learning models for a given problem. The goal of the article is to illustrate this in a mathematically sound and numerically convincing fashion.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call