High Performance Parallel Stochastic Gradient Descent in Shared Memory

Scott Sallinen,Nadathur Satish,Mikhail Smelyanskiy,Christopher Re,Samantika S. Sury

doi:10.1109/ipdps.2016.107

Abstract

Stochastic Gradient Descent (SGD) is a popular optimization method used to train a variety of machine learning models. Most of SGD work to-date has concentrated on improving its statistical efficiency, in terms of rate of convergence to the optimal solution. At the same time, as parallelism of modern CPUs continues to increase through progressively higher core counts, it is imperative to understand the parallel hardware efficiency of SGD, which often comes at odds with its statistical efficiency. In this paper, we explore several modern parallelization methods of SGD on a shared memory system, in the context of sparse and convex optimization problems. Specifically, we develop optimized parallel implementations of several SGD algorithms, and show that their parallel efficiency is severely limited by inter-core communication. We propose a new, scalable, communication-avoiding, many-core friendly implementation of SGD, called HogBatch, which exposes parallelism on several levels, minimizes the impact on statistical efficiency, and, as a result significantly outperforms the other methods. On a variety of datasets, HogBatch demonstrates near linear scalability on a system with 14 cores, as well as delivers up to a 20X speedup over previous methods.

Full Text