Abstract

Distributed deep training has become a significant consumer of bandwidth across datacenter-scale networks. The diverse parallel strategies employed in deep training require different communication patterns, necessitating the periodic adaptation of dynamic topologies. Since electrical switching approaches its capacity limit due to high bandwidths and has difficulties in regard to topology adaptation (i.e., logical and physical topologies are isomorphic), optical switching has become an attractive option to address these bottlenecks. In this paper, we propose Modoru, a wavelength- and datarate-agnostic Clos architecture with a switching speed of O(10ns). Modoru is a drop-in replacement solution that has no constraints on achieving a high radix. To verify its topological flexibility, we also develop topology-as-a-service, which provisions sequentially dynamic topologies for training jobs and guarantees high topology availability over the entire network. Large-scale simulations show a basic 7.9× acceleration in deep training jobs using Modoru. Additionally, experiments on the Modoru prototype demonstrate acceleration of deep training jobs through the provisioning of adaptive topologies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call