Learning without loss

Veit Elser

doi:10.1186/s13663-021-00697-1

Abstract

We explore a new approach for training neural networks where all loss functions are replaced by hard constraints. The same approach is very successful in phase retrieval, where signals are reconstructed from magnitude constraints and general characteristics (sparsity, support, etc.). Instead of taking gradient steps, the optimizer in the constraint based approach, called relaxed–reflect–reflect (RRR), derives its steps from projections to local constraints. In neural networks one such projection makes the minimal modification to the inputs x, the associated weights w, and the pre-activation value y at each neuron, to satisfy the equation xcdot w=y. These projections, along with a host of other local projections (constraining pre- and post-activations, etc.) can be partitioned into two sets such that all the projections in each set can be applied concurrently—across the network and across all data in the training batch. This partitioning into two sets is analogous to the situation in phase retrieval and the setting for which the general purpose RRR optimizer was designed. Owing to the novelty of the method, this paper also serves as a self-contained tutorial. Starting with a single-layer network that performs nonnegative matrix factorization, and concluding with a generative model comprising an autoencoder and classifier, all applications and their implementations by projections are described in complete detail. Although the new approach has the potential to extend the scope of neural networks (e.g. by defining activation not through functions but constraint sets), most of the featured models are standard to allow comparison with stochastic gradient descent.

Highlights

When general purpose computers arrived in the 1960s it was realized that certain tasks, such as sorting and Fourier transforms, would be so ubiquitous that it made sense to implement them with provably optimal algorithms
7 Conclusions The utility of neural networks for representing and distilling complex data cannot be overstated. Does this utility derive from the forgiving nature of the platform, on which even unsophisticated and often undisciplined training usually succeeds? Or have neural networks risen to the top because they are exceptionally well suited for gradient descent, the training algorithm one would like to use because of its intuitive appeal? One way to address these questions, and the one taken in this paper, is to try a radically different approach to training
Our approach avoids gradients and loss functions and was inspired by phase retrieval, where the most successful algorithms take steps derived from constraint projections

Summary

Introduction

When general purpose computers arrived in the 1960s it was realized that certain tasks, such as sorting and Fourier transforms, would be so ubiquitous that it made sense to implement them with provably optimal algorithms. While there is choice of loss function to apply to the training task, Elser Fixed Point Theory Algorithms Sci Eng (2021) 2021:12 the inherent complexity of the models makes proving optimality, for any loss, well beyond reach. Faced with the theoretical intractability of neural network training, it is not surprising that research has narrowed on a single empirical strategy: gradient descent. Central to this method of training is a loss function that encapsulates everything relevant to the application, from the definition of class boundaries, to the structure of internal representations, to details such as model sparsity and parameter quantization. The result is that the theory of neural network training has become a single evolving paradigm

Methods

Findings

Conclusion