Abstract
With the increasing complexity of computational tasks faced by artificial intelligence technology, the scale of machine learning models continues to expand, and the data volume and frequency of parameter synchronization also increase. This will cause the communication bandwidth within the GPU cluster to become the biggest bottleneck for distributed model training. Many existing solutions cannot be widely promoted due to the need for professional equipment support, high cost, and difficulty in use. To solve this problem, this paper proposes a multi-network card bypass parallel communication mechanism based on Intel DPDK technology to increase the bandwidth within the GPU cluster at a lower cost and make full use of the idle CPU resources of the GPU server to accelerate data transmission. Firstly, we propose a data transmission model based on multiple network cards, and design a port load balancing algorithm to ensure load balancing of multiple network cards. Secondly, the model and algorithm of CPU multi-core scheduling are implemented to reduce CPU energy consumption, resource occupation, and the impact on other applications. Furthermore, for multiple application scenarios, a rate adjustment model and algorithm are designed and implemented to ensure fair use of application bandwidth. Finally, the experimental results show that this mechanism can provide high bandwidth for GPU clusters with inexpensive multi-network cards, and provide superimposed bandwidth of multi-network cards in a single connection, which has high reliability and transmission efficiency, and is simple to use and flexible to expand.
Highlights
In recent years, artificial intelligence technology has achieved unprecedented breakthroughs and achievements in many application fields relying on powerful computing power and massive training data
The BPCM system is developed based on the DPDK17.11.3 version
As a hardware platform of distributed machine learning system, GPU cluster plays a decisive role in the speed of machine learning model training
Summary
Artificial intelligence technology has achieved unprecedented breakthroughs and achievements in many application fields relying on powerful computing power and massive training data. The current popular single-machine multi-GPU server has a limited number of GPUs, and the cost is too high It cannot perform well in larger-scale model training. In a distributed machine learning system, communication time-consuming among GPU nodes is easy to become the bottleneck of the entire distributed training task [9]. To solve this problem, researchers have proposed many excellent solutions from the aspects of reducing data transmission volume, increasing communication bandwidth, improving network communication protocols, and underlying physical topology. In the distributed training process, excellent parameter synchronization architecture or algorithms can reduce the amount of communication data and delay. Parameter Hub proposed PBox, which is equipped with multiple network cards on a centralized PS to match IO performance with memory bandwidth [17]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have