A quantitative analysis of fault tolerance mechanisms for parallel machine learning systems with parameter servers

Mingxi Li,Hidemoto Nakada,Yusuke Tanimura

doi:10.1145/3022227.3022295

Abstract

Parallel computation is essential for machine learning systems that handle large amount of data for training. One popular form of parallel machine learning is the one called 'Data parallel' which use large number of parameter servers that manage computational workers. Fault tolerance is always a crucial issue on large scale computation systems in general, and parameter server based machine learning systems are no exception. However, there are many discussions on the fault tolerance of large scale computation systems in general, there are no discussions on parallel machine learning systems, in spite of their unique characteristics. In this paper, we discuss the fault tolerance of parallel machine learning systems which use parameter servers that provides extra redundancy to the system and could double as the checkpoint server. We also quantitatively evaluate several fault tolerance method using parallel environment simulator SimGrid, and demonstrate the effectiveness of the proposed method.

Full Text