Abstract

As the volume of model data increases, traditional machine learning is not able to train models efficiently, so distributed machine learning is gradually used in large-scale data training. Currently, commonly used distributed machine learning algorithms are based on data parallelism, and often use an overall synchronous parallel strategy when passing data, but using this strategy makes the overall training speed limited by the computation speed of the slower workers in the cluster. While the asynchronous parallel strategy maximizes the computational speed of the cluster, there is a delay in updating the parameters of the global model, which may lead to excessive computational errors or non-convergence of the model. In this paper, the author combines these two data delivery methods by grouping workers together and using synchronous parallelism for the workers in the group and asynchronous parallelism for the components for training. The experiment shows that the hybrid parallelism strategy can reduce the training time with guaranteed correctness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.