Abstract

The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.

Highlights

  • Deep learning (DL) is a branch of machine learning that has become a major driving force behind the progress in artificial intelligence applications such as image classification,[1] natural language processing,[2] and recommendation systems.[3]

  • Similar performance improvements are observed for the server regrouping and the server regrouping with bandwidth steering above the ToR as 67% and 47% in execution time differences (3.0 and 1.9 improvements), respectively

  • We have shown a reconfigurable datacenter/high-performance computing (HPC) system architecture using silicon photonic (SiP) switches to accelerate distributed deep learning training workloads

Read more

Summary

INTRODUCTION

Deep learning (DL) is a branch of machine learning that has become a major driving force behind the progress in artificial intelligence applications such as image classification,[1] natural language processing,[2] and recommendation systems.[3] The demand for better DL models has resulted in a rise of more complex models that support larger dataset sizes to improve these deep neural networks.[4,5] The typical approach to speed up the training process of these larger DL models is parallelization using many GPU-equipped nodes,[6,7,8] which requires a high-bandwidth interconnect to support the communication requirements between training devices.[9] DL workloads are taking a large proportion of the computation in today’s high-performance computing (HPC) operations, and observation has shown that the demand is dramatically growing in datacenters.[10] These trends have shifted the performance bottleneck from the compute to the network interconnect due to system fragmentation (applications often receive an allocation on a set of distant and non-contiguous nodes) This places a tremendous challenge on interconnect designs to provide high bandwidth and low latency networking to sustain the continual growth of these hardwaredriven deep learning applications. Our simulation results show that server regrouping can deliver up to 2.3 flow throughput improvement for a 2 tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is applied

SILICON PHTONICS FOR OPTICAL CIRCUIT SWITCHING
System Architecture
SiP Switches and Control
EXPERIMENTS AND RESULTS
SYSTEM-SCALE EVALUATION
Simulation Setup
Server Regrouping and Bandwidth Steering
Results
CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.