Convergence analysis of distributed stochastic gradient descent with shuffling

Qi Meng,Wei Chen,Yue Wang,Zhi-Ming Ma,Tie-Yan Liu

doi:10.1016/j.neucom.2019.01.037

Abstract

When using stochastic gradient descent (SGD) to solve large-scale machine learning problems especially deep learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the re-shuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. Second, we prove SGD with global shuffling and local shuffling has convergence guarantee for non-convex tasks like deep learning. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there’s no communication between partitioned data. We also consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Finally, we give the convergence analysis in convex case. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Convergence analysis of distributed stochastic gradient descent with shuffling

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Journal: Neurocomputing	Publication Date: Jan 22, 2019
Citations: 85

Similar Papers

Quasi-Newton Optimization Methods for Deep Learning Applications
Jacob Rafati ... Roummel F Marica
-
Jacob Rafati, et. al.Jacob Rafati ... Roummel F Marica
01 Jan 2020
01 Jan 2020

Experimental Comparison of Stochastic Optimizers in Deep Learning
Emmanuel Okewu ... Philip Adewole
-
Emmanuel Okewu, et. al.Emmanuel Okewu ... Philip Adewole
01 Jan 2019
01 Jan 2019

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions.
Ning Yang ... Chao Tang
Physical Review Letters | VOL. 130
Ning Yang, et. al.Ning Yang ... Chao Tang
07 Jun 2023
Physical Review Letters | VOL. 130

A modified Adam algorithm for deep neural network optimization
Mohamed Reyad ... Amany M Sarhan
Neural Computing and Applications | VOL. 35
Mohamed Reyad, et. al.Mohamed Reyad ... Amany M Sarhan
25 Apr 2023
Neural Computing and Applications | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Convergence analysis of distributed stochastic gradient descent with shuffling

Abstract

Talk to us

Similar Papers

More From: Neurocomputing