Horizontal or Vertical?

Jinkun Geng,Shuai Wang,Dan Li

doi:10.1145/3322795.3331461

Abstract

Data parallelism and model parallelism are two typical parallel modes for distributed machine learning (DML). Traditionally, DML mainly leverages data parallelism, which maintains one model instance for each node and synchronizes the model parameters at the end of every iteration. However, as the model grows larger, communication cost and GPU memory consumption become significant. Data parallelism thus fails to work efficiently in large scale, and model-parallel solutions are proposed in recent years. In this paper, we comprehensively discuss the benefits and drawbacks on both sides. Based on the comparative analysis, we propose Hove, a hybrid approach incorporating data parallelism and model parallelism to balance the overheads and achieve high performance for large-scale DML.

Full Text