NetShield: An in-network architecture against byzantine failures in distributed deep learning

Qingqing Ren,Shuyong Zhu,Lu Lu,Zhiqiang Li,Guangyu Zhao,Yujun Zhang

doi:10.1016/j.comnet.2023.110081

Abstract

There is a growing trend of training deep learning networks on distributed clusters. Unfortunately, distributed deep learning (DDL) is prone to Byzantine failures where some nodes corrupt training by sending malicious gradients to the parameter server (PS). Existing works address this problem by implementing Byzantine defenses on the PS. However, Byzantine defenses come with large computational overhead, seriously affecting the DDL’s training performance. Moreover, malicious gradients are not identified until they are transmitted to the endpoint (PS), which leads to a waste of network resources and a decrease in communication efficiency.In this paper, we propose NetShield, a high-performance and universal in-network architecture against Byzantine failures in DDL. By exploiting the computing power of programmable switches, various Byzantine defenses are accelerated and the communication efficiency is increased. Based on NetShield, we improve two popular Byzantine-robust methods (the median-based method and the trimmed-mean-based method) and construct two switch-friendly defenders. We implement NetShield on a real testbed which includes a programmable switch. In comparison with traditional Byzantine defense deployments, NetShield reduces the total processing time by more than 2 times. The computing time is decreased by 2 ∼ 5 times and the traffic volume is reduced by half. Besides, NetShield shows a good universality by deploying two designed defenders flexibly.

Full Text