Flexible silicon photonic architecture for accelerating distributed deep learning

Zhenguo Wu,Songli Wang,Liang Yuan Dai,Yuyang Wang,Keren Bergman

doi:10.1364/jocn.497372

Abstract

The increasing size and complexity of deep learning (DL) models have led to the wide adoption of distributed training methods in datacenters (DCs) and high-performance computing (HPC) systems. However, communication among distributed computing units (CUs) has emerged as a major bottleneck in the training process. In this study, we propose Flex-SiPAC, a flexible silicon photonic accelerated compute cluster designed to accelerate multi-tenant distributed DL training workloads. Flex-SiPAC takes a co-design approach that combines a silicon photonic hardware platform with a tailored collective algorithm, optimized to leverage the unique physical properties of the architecture. The hardware platform integrates a novel wavelength-reconfigurable transceiver design and a micro-resonator-based wavelength-reconfigurable switch, enabling the system to achieve flexible bandwidth steering in the wavelength domain. The collective algorithm is designed to support reconfigurable topologies, enabling efficient all-reduce communications that are commonly used in DL training. The feasibility of the Flex-SiPAC architecture is demonstrated through two testbed experiments. First, an optical testbed experiment demonstrates the flexible routing of wavelengths by shuffling an array of input wavelengths using a custom-designed spatial-wavelength selective switch. Second, a four-GPU testbed running two DL workloads shows a 23% improvement in job completion time compared to a similarly sized leaf-spine topology. We further evaluate Flex-SiPAC using large-scale simulations, which show that Flex-SiPAC is able to reduce the communication time by 26% to 29% compared to state-of-the-art compute clusters under representative collective operations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Flexible silicon photonic architecture for accelerating distributed deep learning

Abstract

Talk to us

Similar Papers

More From: Journal of Optical Communications and Networking

Lead the way for us

Similar Papers

Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications
Zhenguo Wu ... Liang Yuan Dai
Journal of Lightwave Technology | VOL. 41
Zhenguo Wu, et. al.Zhenguo Wu ... Liang Yuan Dai
15 Jun 2023
Journal of Lightwave Technology | VOL. 41

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
Jingoo Han ... Ali R Butt
-
Jingoo Han, et. al.Jingoo Han ... Ali R Butt
01 May 2020
01 May 2020

DWPE, a new data center energy-efficiency metric bridging the gap between infrastructure and workload
Torsten Wilde ... Axel Auweter
-
Torsten Wilde, et. al.Torsten Wilde ... Axel Auweter
01 Jul 2014
01 Jul 2014

Convergence-aware optimal checkpointing for exploratory deep learning training jobs
Hongliang Li ... Haixiao Xu
Future Generation Computer Systems | VOL. 164
Hongliang Li, et. al.Hongliang Li ... Haixiao Xu
01 Mar 2025
Future Generation Computer Systems | VOL. 164

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Flexible silicon photonic architecture for accelerating distributed deep learning

Abstract

Talk to us

Similar Papers

More From: Journal of Optical Communications and Networking