A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning

Yanan Li,Shusen Yang,Fangyuan Zhao,Xuebin Ren

doi:10.3390/app112110184

Abstract

Due to powerful data representation ability, deep learning has dramatically improved the state-of-the-art in many practical applications. However, the utility highly depends on fine-tuning of hyper-parameters, including learning rate, batch size, and network initialization. Although many first-order adaptive methods (e.g., Adam, Adagrad) have been proposed to adjust learning rate based on gradients, they are susceptible to the initial learning rate and network architecture. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyper-parameters. To address this, we propose a heuristic zeroth-order learning rate method, Adacomp, which adaptively adjusts the learning rate based only on values of the loss function. The main idea is that Adacomp penalizes large learning rates to ensure the convergence and compensates small learning rates to accelerate the training process. Therefore, Adacomp is robust to the initial learning rate. Extensive experiments, including comparison to six typically adaptive methods (Momentum, Adagrad, RMSprop, Adadelta, Adam, and Adamax) on several benchmark datasets for image classification tasks (MNIST, KMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100), were conducted. Experimental results show that Adacomp is not only robust to the initial learning rate but also to the network architecture, network initialization, and batch size.

Highlights

Deep learning has been highly successful across a variety of applications, including speech recognition, visual object recognition, and object detection [1,2,3]
We conducted Adacomp (set β = 0.6 in Equation (9)) on MNIST dataset to validate its robustness from aspects of learning rate, batch size, and initial model parameters
We conclude the robustness of eight methods with respect to different network architectures

Summary

Introduction

Deep learning has been highly successful across a variety of applications, including speech recognition, visual object recognition, and object detection [1,2,3]. Deep learning consists of training and inference phases. A predefined network is trained on a given dataset (known as the training set) to learn the underlying distribution characteristics. The well-trained network is used for unforeseen data (known as the test set) to implement specific tasks, such as regression and classification. One fundamental purpose of deep learning is to achieve as high accuracy as possible in the inference phase after only learning from the training set. In essence, training a deep learning network is equivalent to minimizing an unconstrained non-convex but smooth function

Methods

Results

Conclusion