Abstract

In this paper, we provide new results and algorithms (including backtracking versions of Nesterov accelerated gradient and Momentum) which are more applicable to large scale optimisation as in Deep Neural Networks. We also demonstrate that Backtracking Gradient Descent (Backtracking GD) can obtain good upper bound estimates for local Lipschitz constants for the gradient, and that the convergence rate of Backtracking GD is similar to that in classical work of Armijo. Experiments with datasets CIFAR10 and CIFAR100 on various popular architectures verify a heuristic argument that Backtracking GD stabilises to a finite union of sequences constructed from Standard GD for the mini-batch practice, and show that our new algorithms (while automatically fine tuning learning rates) perform better than current state-of-the-art methods such as Adam, Adagrad, Adadelta, RMSProp, Momentum and Nesterov accelerated gradient. To help readers avoiding the confusion between heuristics and more rigorously justified algorithms, we also provide a review of the current state of convergence results for gradient descent methods. Accompanying source codes are available on GitHub.

Highlights

  • We provide a non-technical overview of the important role and current practices of Gradient Descent methods (GD) in optimisation, in particular in large scale optimisation as in Deep Neural Networks (DNN), and some new features of our main results in this paper.One special feature of the modern society is the need of solving large scale optimisation problems quickly, stably, efficiently and reproducibly

  • Note that as we mentioned in the Introduction—based on results and ideas in Truong and Nguyen [47]—new versions of Backtracking GD are proposed in Truong [44] and shown to be able to avoid saddle points, under assumptions more general than those required by Lee et al [27], Panageas and Piliouras [34]) for Standard

  • We do experiments with Two-way Backtracking GD for two cost functions: one is the Mexican hat in Example 3.3, and the other is the cost function coming from applying Resnet18 on a random set of 500 samples of CIFAR10

Read more

Summary

Introduction

We provide a non-technical overview of the important role and current practices of Gradient Descent methods (GD) in optimisation, in particular in large scale optimisation as in Deep Neural Networks (DNN), and some new features of our main results in this paper. One special feature of the modern society is the need of solving large scale optimisation problems quickly, stably, efficiently and reproducibly One exemplar for this is the development of Deep Learning, which has obtained spectacular achievements recently. Modern state of the art DNN can have millions of parameters With this big size of optimisation problem, the only tool one could rely on are numerical optimisation algorithms, serving to arrive closely to good local minima. As great as it is, there are many serious concerns about the current practices in Deep Learning such as it is very fooled and is still not safe. This is the most general class that current techniques can be used in solving non-convex optimisation problems, and is flexible enough to adapt to many kinds of realistic applications

A Brief Introduction to Gradient Descent Methods
What is the State-of-the-Art for Convergence of GD Methods?
What is New About this Paper?
Overview and Comparison of Previous Results
A General Convergence Result for Backtracking GD
Comparison to Previous Work
Backtracking Versions of MMT and NAG
A Heuristic Argument for the Effectiveness of Standard GD
Two-Way Backtracking GD
Unbounded Backtracking GD
Rescaling of Learning Rates
Mini-batch Backtracking Algorithms
Experimental Results
Experiment 1
Experiment 2
Experiment 3
Experiment 4
Experiment 5
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call