Abstract

Adaptive gradient methods such as adaptive moment estimation (Adam), RMSProp, and adaptive gradient (AdaGrad) use the temporal history of the gradient updates to improve the speed of convergence and reduce reliance on manual learning rate tuning, making them a popular choice for off-the-shelf Deep Neural Network (DNN) optimizers. In this article, we study the robustness of neural network optimizers in the presence of training perturbations. We show that popular adaptive optimization methods exhibit poor generalization while learning from noisy training data, compared to vanilla Stochastic Gradient Descent (SGD) and its variants, which manifest better implicit regularization properties. We construct an illustrative example of a family of two-class linearly separable toy-data such that models trained under noise using adaptive optimizers show only 52% test accuracy (random classifier), whereas SGD-based methods can achieve 100% test accuracy. We strengthen our hypothesis by empirical analysis using Convolutional Neural Networks (CNNs) on publicly available image datasets. For this purpose, our method trains neural network models with various optimizers on noisy training data, and we compute test accuracy on clean test data. Our results further highlight the robustness of SGD optimization against such noisy training data compared to its adaptive counterparts. Based on the results, our paper suggests a reconsideration of the extensive use of adaptive gradient methods for neural network optimization, especially when the training data is noisy.

Highlights

  • Deep Neural Network (DNN) [1] are high capacity models where the number of learnable parameters is very large given a finite number of training data

  • We analytically show that adaptive gradient methods completely fail to learn any patterns from the data and do not generalize to the clean test set

  • RELATED WORK We present some previous works on generalization in neural networks and their effects based on optimization strategies

Read more

Summary

INTRODUCTION

DNNs [1] are high capacity models where the number of learnable parameters is very large given a finite number of training data. We benchmark and compare the performance of adaptive and non-adaptive gradient methods in the presence of training perturbations which is a real problem that can occur in noisy acquisition devices. Benchmarking the robustness of optimizers against such training noise is an important task, which gives us a measure of whether the features selected by the optimizer conform to semantic information such as color and shape (for images) that enables better generalization to noiseless test samples. Based on our benchmarking results, in the presence of training noise, our results suggest using SGD-based optimizers with learning rate tuning instead of adaptive gradient methods for better generalization performance

RELATED WORK
COMMON FRAMEWORK FOR OPTIMIZATION
BENCHMARKING OPTIMIZERS ON HIGH DIMENSIONAL TRAINING DATA
4: Get sematic mismatch Smj
EXPERIMENTAL RESULTS
VIII. CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call