As a simple but effective way, the momentum method has been widely adopted in stochastic optimization algorithms for large-scale machine learning problems and the success of stochastic optimization with the momentum term for many applications in machine learning and other related areas has been reported everywhere. However, the understanding of how the momentum improves the performance of modern variance reduced stochastic gradient algorithms, e.g., the stochastic dual coordinate ascent average gradient (SDCA) method, the stochastically controlled stochastic gradient (SCSG) method, the stochastic recursive gradient algorithm (SARAH), etc., is still limited. To tackle this issue, this work studies the performance of SARAH with the momentum term theoretically and empirically, and develops a novel variance reduced stochastic gradient algorithm, termed as SARAH-M. We rigorously prove that SARAH-M attains a linear rate of convergence for minimizing the strongly convex function. We further propose an adaptive SARAH-M method (abbreviated as AdaSARAH-M) by incorporating the random Barzilai–Borwein (RBB) technique into SARAH-M, which provides an easy way to determine the step size for the original SARAH-M algorithm. The theoretical analysis that shows AdaSARAH-M with a linear convergence speed is also provided. Moreover, we show that the complexity of the proposed algorithms can outperform modern stochastic optimization algorithms. Finally, the numerical results, compared with state-of-the-art algorithms on benchmarking machine learning problems, verify the efficacy of the momentum in variance reduced stochastic gradient algorithms.
Read full abstract