Best k-Layer Neural Network Approximations

Lek-Heng Lim,Yang Qi,Mateusz Michałek

doi:10.1007/s00365-021-09545-2

Abstract

We show that the empirical risk minimization (ERM) problem for neural networks has no solution in general. Given a training set $s_1, \ldots , s_n \in {\mathbb {R}}^p$ with corresponding responses $t_1,\ldots ,t_n \in {\mathbb {R}}^q$, fitting a k-layer neural network $\nu _\theta : {\mathbb {R}}^p \rightarrow {\mathbb {R}}^q$ involves estimation of the weights $\theta \in {\mathbb {R}}^m$ via an ERM: $$\begin{aligned} \inf _{\theta \in {\mathbb {R}}^m} \ \sum _{i=1}^n \Vert t_i - \nu _\theta (s_i) \Vert _2^2. \end{aligned}$$We show that even for $k = 2$, this infimum is not attainable in general for common activations like ReLU, hyperbolic tangent, and sigmoid functions. In addition, we deduce that if one attempts to minimize such a loss function in the event when its infimum is not attainable, it necessarily results in values of $\theta $ diverging to $\pm \infty $. We will show that for smooth activations $\sigma (x)= 1/\bigl (1 + \exp (-x)\bigr )$ and $\sigma (x)=\tanh (x)$, such failure to attain an infimum can happen on a positive-measured subset of responses. For the ReLU activation $\sigma (x)=\max (0,x)$, we completely classify cases where the ERM for a best two-layer neural network approximation attains its infimum. In recent applications of neural networks, where overfitting is commonplace, the failure to attain an infimum is avoided by ensuring that the system of equations $t_i = \nu _\theta (s_i)$, $i =1,\ldots ,n$, has a solution. For a two-layer ReLU-activated network, we will show when such a system of equations has a solution generically, i.e., when can such a neural network be fitted perfectly with probability one.

Full Text