Abstract

SummaryTraining models on large‐scale GPUs‐accelerated clusters are becoming a commonplace due to the increase in complexity and size in deep learning models. One of the main challenges for distributed training is the collective communication overhead for large message sizes: up to hundreds of MB. In this paper, we propose two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU‐accelerated clusters (named lr_lr and lr_rab), in which GPUs inside a computing node perform an intra‐node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders). Node leaders then keep a role as an inter‐node communicator. Each leader exchanges one part of reduced values to the leaders of the other nodes in parallel. Hence, we are capable of significantly reducing the time for injecting data into the inter‐node network. We also overlap the inter‐node and intra‐node communication by implementing our proposal in a pipelined manner. We evaluate those algorithms on the discrete‐event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an AllReduce microbenchmark that uses the logical ring algorithm (lr) by up to 45% and 51%, respectively. With the pipelined implementation, our lr_lr_pipe achieves 15% performance improvement when compared with lr_lr. In addition, the simulation result also projects power savings for the network devices of up to 23% and 32%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call