Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Robert M Gower,Peter Richtárik,Francis Bach

doi:10.1007/s10107-020-01506-0

Abstract

We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method—JacSketch—is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (cheap) sketching, and then projecting the previous estimate onto the solution space of a linear matrix equation whose solutions are consistent with the measurement. The Jacobian estimate is then used to compute a variance-reduced unbiased estimator of the gradient. Our strategy is analogous to the way quasi-Newton methods maintain an estimate of the Hessian, and hence our method can be seen as a stochastic quasi-gradient method. Our method can also be seen as stochastic gradient descent applied to a controlled stochastic optimization reformulation of the original problem, where the control comes from the Jacobian estimates. We prove that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches. We also provide a refined convergence theorem which applies to a smaller class of sketches, featuring a novel proof technique based on a stochastic Lyapunov function. This enables us to obtain sharper complexity results for variants of JacSketch with importance sampling. By specializing our general approach to specific sketching strategies, JacSketch reduces to the celebrated stochastic average gradient (SAGA) method, and its several existing and many new minibatch, reduced memory, and importance sampling variants. Our rate for SAGA with importance sampling is the current best-known rate for this method, resolving a conjecture by Schmidt et al. (Proceedings of the eighteenth international conference on artificial intelligence and statistics, AISTATS 2015, San Diego, California, 2015). The rates we obtain for minibatch SAGA are also superior to existing rates and are sufficiently tight as to show a decrease in total complexity as the minibatch size increases. Moreover, we obtain the first minibatch SAGA method with importance sampling.

Highlights

We consider the problem of minimizing the average of a large number of differentiable functions x∗ = arg min x ∈Rd f (x) d=ef 1 n n fi (x), i =1 (1)where f is μ—strongly convex and L—smooth
stochastic gradient descent (SGD) scales well in the number of data samples, which is important in several machine learning applications since there many be a large number of data samples
The variance of the stochastic estimates of the gradient produced by SGD does not vanish during the iterative process, which suggests that a decreasing stepsize regime needs to be put into place if SGD is to converge

Summary

Introduction

In solving (1), we restrict our attention to first-order methods that use a (variance-reduced) stochastic estimate of the gradient gk ≈ ∇ f (xk) to take a step towards minimizing (1) by iterating xk+1 = xk − αgk ,. The need for incremental methods for the training phase of machine learning models has revived the interest in the stochastic gradient descent (SGD) method [27]. The variance of the stochastic estimates of the gradient produced by SGD does not vanish during the iterative process, which suggests that a decreasing stepsize regime needs to be put into place if SGD is to converge. For SGD to work efficiently, this decreasing stepsize regime needs to be tuned for each application area, which is costly.

Variance-reduced methods

Gaps in our understanding of SAGA

Jacobian sketching: a new approach to variance reduction

SAGA as a special case of JacSketch

Summary of complexity results

L max μ

Outline of the paper

Notation

Controlled stochastic reformulations

Stochastic reformulation using sketching

The controlled stochastic reformulation

JacSketch algorithm

A window into biased estimates and SAG

Convergence analysis for general sketches

Two expected smoothness constants

Stochastic contraction number

Convergence theorem

Projection lemmas and the stochastic contraction number Ä

Key lemmas

Proof of Theorem 1

Minibatch sketches

Samplings

Minibatch sketches and projections

Expected smoothness constants L1 and L2

Estimating the sketch residual

Calculating the iteration complexity for special cases

Comparison with previous mini-batch SAGA convergence results

A refined analysis with a stochastic Lyapunov function

Gradient estimate contraction

Jk e n

Proof of Theorem 6

Calculating the iteration complexity in special cases

Experiments

New non-uniform sampling using optimal probabilities

Optimal mini-batch size

Comparative experiments

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematical Programming	Publication Date: May 12, 2020
Citations: 24	License type: open-access

R Discovery Prime

R Discovery Prime

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematical Programming

Lead the way for us

Similar Papers

A Novel Stochastic Stratified Average Gradient Method: Convergence Rate and Its Complexity
Aixiang Andy Chen ... Rui Bian
-
Aixiang Andy Chen, et. al.Aixiang Andy Chen ... Rui Bian
01 Jul 2018
01 Jul 2018

Combining stochastic average gradient and Hessian-free optimization for sequence training of deep neural networks
Pierre Dognin ... Vaibhava Goel
-
Pierre Dognin, et. al.Pierre Dognin ... Vaibhava Goel
01 Dec 2013
01 Dec 2013

Minimizing finite sums with the stochastic average gradient
Mark Schmidt ... Francis Bach
Mathematical Programming | VOL. 162
Mark Schmidt, et. al.Mark Schmidt ... Francis Bach
14 Jun 2016
Mathematical Programming | VOL. 162

CSG: A new stochastic gradient method for the efficient solution of structural optimization problems with infinitely many states
Lukas Pflug ... Max Grieshammer
Structural and Multidisciplinary Optimization | VOL. 61
Lukas Pflug, et. al.Lukas Pflug ... Max Grieshammer
31 May 2020
Structural and Multidisciplinary Optimization | VOL. 61

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematical Programming