Stochastic Momentum Method With Double Acceleration for Regularized Empirical Risk Minimization

Zhijian Luo,Siyu Chen,Yuntao Qian

doi:10.1109/access.2019.2953288

Abstract

Momentum acceleration technique is famously known for building gradient-based algorithms with fast convergence in large-scale optimization. Recently, Nesterov 's momentum and Katyusha momentum have significantly improved the convergence for stochastic optimization problems. However, the practical gain of acceleration with Nesterov's momentum is mainly a by-product of mini-batching, while acceleration merely with Katyusha momentum in stochastic steps would make the optimization unstable. In this paper, we build a stochastic and doubly accelerated momentum method (SDAMM) which incorporates the Nesterov's momentum and Katyusha momentum in the framework of variance reduction, to stabilize the accelerated algorithm and reduce the dependence on the mini-batching. Theoretically, SDAMM achieves the best-known convergence rates for convex objectives. The experimental results demonstrate that our SDAMM is competitive with state-of-the-art methods for the optimization problems in machine learning.

Highlights

In this paper, we consider the following composite convex optimization problem associated with regularized empirical risk minimization (ERM), which is pervasive in machine learning [1]
The acceleration in stochastic and doubly accelerated momentum method (SDAMM) includes the Nesterov’s momentum acceleration in the outer epoch and Katyusha momentum acceleration in the inner iteration, while the practical importance sampling technique is employed for further acceleration
We proved that our SDAMM algorithm achieves the best-known optimal convergence rate of O(1/T 2) and low computational complexity of O(n√1/ + nL / )

Summary

Introduction

We consider the following composite convex optimization problem associated with regularized empirical risk minimization (ERM), which is pervasive in machine learning [1]. Each term fi(x) measures the fitness between x and data sample indexed by i, and the function r(x) acts as the regularization of x to avoid the over-fitting of data. We consider this smooth optimization problem under large-scale setting, and try to seek an approximate minimizer x ∈ Rd , such that F(x) − F(x∗) ≤ , where x∗ is the exact minimizer of F(x).

Methods

Results

Conclusion