Abstract
Graph embedding training models access parameters sparsely in a “one-hot” manner. Currently, the distributed graph embedding neural network is learned by data parallel with the parameter server, which suffers significant performance and scalability problems. In this article, we analyze the problems and characteristics of training this kind of models on distributed GPU clusters for the first time, and find that fixed model parameters scattered among different machine nodes are a major limiting factor for efficiency. Based on our observation, we develop an efficient distributed graph embedding system called EDGES, which can utilize GPU clusters to train large graph models with billions of nodes and trillions of edges using data and model parallelism. Within the system, we propose a novel dynamic partition architecture for training these models, achieving at least one half of communication reduction compared to existing training systems. According to our evaluations on real-world networks, our system delivers a competitive accuracy for the trained embeddings, and significantly accelerates the training process of the graph node embedding neural network, achieving a speedup of 7.23x and 18.6x over the existing fastest training system on single node and multi-node, respectively. As for the scalability, our experiments show that EDGES obtains a nearly linear speedup.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Parallel and Distributed Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.