TrainBox: An Extreme-Scale Neural Network Training Server Architecture by Systematically Balancing Operations

Pyeongsu Park,Jangwoo Kim,Heetaek Jeong

doi:10.1109/micro50266.2020.00072

Abstract

Neural network is a major driving force of another golden age of computing; the computer architects have proposed specialized accelerators (e.g., TPU), high-speed interconnects (e.g., NVLink), and algorithms (e.g., ring-based reduction) to efficiently support neural network applications. As a result, they achieve orders of magnitude higher efficiency for neural network computation and inter-accelerator communication over traditional computing platforms.In this paper, we identify that the emerging platforms have shifted the performance bottleneck of neural network from model computation and inter-accelerator communication to data preparation. Although overlapping data preparation and the others has hidden the preparation overhead, the higher input processing demands of emerging platforms start to reverse the situation; at scale, data preparation requires an infeasible amount of the host-side CPU, memory, and PCIe resources. Our detailed analysis reveals that this heavy resource consumption comes from data transformation for neural network specific formats, and buffering for communication among devices.Therefore, we propose a scalable neural network server architecture by balancing data preparation and the others. To achieve extreme scalability, our design relies on a scalable device array, rather than the limited host resources, with three key ideas. First, we offload CPU-intensive operations to the customized data preparation accelerators to scale the training performance regardless of the host-side CPU performance. Second, we apply direct inter-device communication to eliminate unnecessary data copies and reduce the pressure on the host memory. Lastly, we cluster underlying devices considering unique communication patterns of the neural network processing and interconnect characteristics to efficiently utilize aggregated interconnect band-width. Our evaluation shows that the proposed architecture achieves 44.4× higher training throughput on average over a naively extended server architecture with 256 neural network accelerators.

Full Text