AdaTerm: Adaptive T-distribution estimated robust moments for Noise-Robust stochastic gradient optimization

Takamitsu Matsubara,Wendyam Eric Lionel Ilboudo,Taisuke Kobayashi

doi:10.1016/j.neucom.2023.126692

Takamitsu Matsubara, Wendyam Eric Lionel Ilboudo + Show 1 more

Open Access

https://doi.org/10.1016/j.neucom.2023.126692

Copy DOI

Abstract

With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm’s robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student’s t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student’s t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. AdaTerm achieves this by considering the interdependence of gradient dimensions. In particular, upon detection, AdaTerm excludes aberrant gradients from the update process and enhances its robustness for subsequent updates. Conversely, it performs normal parameter updates when the gradients are statistically valid, allowing for flexibility in adapting its robustness. This noise-adaptive behavior contributes to AdaTerm’s exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field.

Full Text