Growing Neural Networks Achieve Flatter Minima

Paul Caillon,Christophe Cerisara

doi:10.1007/978-3-030-86340-1_18

Abstract

Deep neural networks of sizes commonly encountered in practice are proven to converge towards a global minimum. The flatness of the surface of the loss function in a neighborhood of such minima is often linked with better generalization performances. In this paper, we present a new model of growing neural network in which we incrementally add neurons throughout the learning phase. We study the characteristics of the minima found by such a network compared to those obtained with standard feedforward neural networks. The results of this analysis show that a neural network grown with our procedure converges towards a flatter minimum than a standard neural network with the same number of parameters learned from scratch. Furthermore, our results confirm the link between flatter minima and better generalization performances as the grown models tend to outperform the standard ones. We validate this approach both with small neural networks and with large deep learning models that are state-of-the-art in Natural Language Processing tasks.

Highlights

The goal of our work is to investigate whether growing a feedforward neural network (FNN) from scratch throughout the learning phase yields better loss surface properties at minima than learning a standard FNN, both having the same number of parameters, using the measure developed in [32]
Our objective is to study the properties of the final loss surface when adding neurons one by one, all other hyper-parameters being equal
The results show that the resulting loss function has flatter minima than with the traditional training procedure on the full network

Summary

Introduction

Over the last few years, deep learning [20] has had empirical successes in multiple research domains, such as computer vision, speech recognition, and machine translation [11], [35], [28]. [7] questions this assumption, by showing that for deep neural networks with rectifier units, most Hessian-based measures of the flatness of the loss minimum are sensible to rescaling, making it possible to build equivalent models corresponding to arbitrarily sharper minima. To address this issue, a recent work [32] introduced a particular measure invariant to rescaling to show that flatter minima obtain better generalization performances than sharper ones. A recent work, [16] theoretically proves a related result, which is that adding one special neuron by output unit eliminates all suboptimal local minima of any neural network

Objectives

Results

Discussion

Conclusion