Broadcast Latency Research Articles

We present and evaluate the ExaNeSt Prototype, which compactly packages 128 Xilinx ZU9EG MPSoCs, 2 TBytes of DRAM, and 8 TBytes of SSD into a liquid-cooled rack, using a custom interconnection hardware based on 10 Gbps links. We developed this testbed in 2016-2019 in order to leverage the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest towards Exascale systems and beyond. In the years since then, we carefully studied this system, and we present our key design choices and insights resulting from our measurement and analysis. We developed this testbed, from architecture to the PCBs and the runtime software, within the ExaNeSt project. It is fully operational in configurations with up to 8x4x4 MPSoC nodes. It achieves high density through tight board design, while also leveraging state-of-the-art liquid cooling technology. In this paper, we present a thorough architectural analysis, along with important aspects of our infrastructure development. Our custom interconnect includes a low-cost low-latency network interface, offering user-level, zero-copy RDMA, which we coupled with the ARMv8 processors in the MPSoCs. We further developed the corresponding runtimes that allow us to test real MPI applications on the large-scale testbed. We evaluated our platform through MPI microbenchmarks, mini, and full MPI applications. Single hop, one way latency is \(1.3\) \(\mu\) s; approximately \(0.47\) \(\mu\) s out of these are attributed to network interface and the user-space library that exposes its functionality to the runtime. Latency over longer paths increases as expected, reaching \(2.55\) \(\mu\) s for a five-hop path. Bandwidth tests show that, for single hop, link utilization reaches \(82\%\) of the theoretical capacity. Microbenchmarks based on MPI collectives reveal that broadcast latency scales as expected when the number of participating ranks increases. We also implemented a custom MPI_Allreduce accelerator in the network interface, which reduces the latency of such collectives by up to \(88\%\) . We assess performance scaling through weak and strong scaling tests for HPCG, LAMMPS, and the miniFE mini application; for all these tests, parallelization efficiency is at least \(69\%\) , or better.

Read full abstract

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this paper, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) A pipelined chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) A Topology-Aware pipelined chain (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node.To highlight the benefits of our designs, we present an in-depth performance landscape for the proposed MPI_Bcast (MPI) designs, our earlier NCCL-based MPI_Bcast (MPI+NCCL1) design, and ncclBroadcast (NCCL2) design. The proposed designs offer up to 14 × and 16.6 × better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency, respectively. With the recent introduction of NCCL2 (inter-node capable) library, we have enhanced our performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10 × better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. Furthermore, we provide application level performance comparison using a CUDA-Aware version of CNTK called CA-CNTK. The proposed MPI_Bcast designs provide up to 7% improvement over MPI+NCCL based solutions for data parallel training of the VGG network on 128 GPUs. We present our performance evaluation on three GPU clusters with diverse characteristics: (1) KESCH; a dense multi-GPU system with 8 K80 GPU cards per node, (2) RI2; with a single K80 GPU card per node, and (3) Owens; with a single P100 GPU per node.

Read full abstract

Broadcast Latency Research Articles

Related Topics

Articles published on Broadcast Latency

The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack

Structure-Adaptive and Power-Aware Broadcast Scheduling for Multihop Wireless-Powered IoT Networks

Distributed Stable Multisource Global Broadcast for SINR-Based Wireless Multihop Networks

Broadcast Scheduling Protocols in Multi-Hop Mobile Ad hoc Networks

Structure-Free Broadcast Scheduling for Duty-Cycled Multihop Wireless Sensor Networks

Analysis and Optimization of Multihop Broadcast Communication in the Internet of Vehicles Based on C-V2X Mode 4

Distribution of multi-hop latency for probabilistic broadcasting protocols in grid-based Wireless Sensor Networks

Energy efficient broadcast protocols for asynchronous duty-cycled wireless sensor networks

Broadcast Scheduling in Battery-Free Wireless Sensor Networks

Latency Reduction in Probabilistic Broadcast Protocols for Ad Hoc Networks

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

Characterizing User Behaviors in Mobile Personal Livecast

Dandelion++

Dandelion++

Critical‐Path Aware Scheduling for Latency Efficient Broadcast in Duty‐Cycled Wireless Sensor Networks

Live Broadcast With Community Interactions: Bottlenecks and Optimizations

Collision-tolerant broadcast scheduling in duty-cycled wireless sensor networks

Low-Latency Multi-Flow Cooperative Broadcast in Fading Wireless Networks

Y-Hamiltonian Layers Broadcast Algorithm

X-Hamiltonian Surface Broadcast Algorithm

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Broadcast Latency Research Articles

Related Topics

Articles published on Broadcast Latency

The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack

Structure-Adaptive and Power-Aware Broadcast Scheduling for Multihop Wireless-Powered IoT Networks

Distributed Stable Multisource Global Broadcast for SINR-Based Wireless Multihop Networks

Broadcast Scheduling Protocols in Multi-Hop Mobile Ad hoc Networks

Structure-Free Broadcast Scheduling for Duty-Cycled Multihop Wireless Sensor Networks

Analysis and Optimization of Multihop Broadcast Communication in the Internet of Vehicles Based on C-V2X Mode 4

Distribution of multi-hop latency for probabilistic broadcasting protocols in grid-based Wireless Sensor Networks

Energy efficient broadcast protocols for asynchronous duty-cycled wireless sensor networks

Broadcast Scheduling in Battery-Free Wireless Sensor Networks

Latency Reduction in Probabilistic Broadcast Protocols for Ad Hoc Networks

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

Characterizing User Behaviors in Mobile Personal Livecast

Dandelion++

Dandelion++

Critical‐Path Aware Scheduling for Latency Efficient Broadcast in Duty‐Cycled Wireless Sensor Networks

Live Broadcast With Community Interactions: Bottlenecks and Optimizations

Collision-tolerant broadcast scheduling in duty-cycled wireless sensor networks

Low-Latency Multi-Flow Cooperative Broadcast in Fading Wireless Networks

Y-Hamiltonian Layers Broadcast Algorithm

X-Hamiltonian Surface Broadcast Algorithm