Abstract

Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Such schemes can be considered as discretizations of the corresponding gradient flows (GFs). In this work we analyze GF processes in the training of ANNs with ReLU activation and three layers. In particular, in this article we prove in the case where the distribution of the input data is absolutely continuous with respect to the Lebesgue measure that the risk of every bounded GF trajectory converges to the risk of a critical point. In addition, we show in the case of a 1-dimensional affine target function and a uniform input distribution that the risk of every bounded GF trajectory converges to zero if the initial risk is sufficiently small. Finally, we show that the boundedness assumption can be removed if the hidden layer consists of only one neuron.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call