Abstract

As the volume of model data increases, traditional machine learning is not able to train models efficiently, so distributed machine learning is gradually used in large-scale data training. Currently, commonly used distributed machine learning algorithms are based on data parallelism, and often use an overall synchronous parallel strategy when passing data, but using this strategy makes the overall training speed limited by the computation speed of the slower workers in the cluster. While the asynchronous parallel strategy maximizes the computational speed of the cluster, there is a delay in updating the parameters of the global model, which may lead to excessive computational errors or non-convergence of the model. In this paper, the author combines these two data delivery methods by grouping workers together and using synchronous parallelism for the workers in the group and asynchronous parallelism for the components for training. The experiment shows that the hybrid parallelism strategy can reduce the training time with guaranteed correctness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call