We present and evaluate the ExaNeSt Prototype, which compactly packages 128 Xilinx ZU9EG MPSoCs, 2 TBytes of DRAM, and 8 TBytes of SSD into a liquid-cooled rack, using a custom interconnection hardware based on 10 Gbps links. We developed this testbed in 2016-2019 in order to leverage the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest towards Exascale systems and beyond. In the years since then, we carefully studied this system, and we present our key design choices and insights resulting from our measurement and analysis. We developed this testbed, from architecture to the PCBs and the runtime software, within the ExaNeSt project. It is fully operational in configurations with up to 8x4x4 MPSoC nodes. It achieves high density through tight board design, while also leveraging state-of-the-art liquid cooling technology. In this paper, we present a thorough architectural analysis, along with important aspects of our infrastructure development. Our custom interconnect includes a low-cost low-latency network interface, offering user-level, zero-copy RDMA, which we coupled with the ARMv8 processors in the MPSoCs. We further developed the corresponding runtimes that allow us to test real MPI applications on the large-scale testbed. We evaluated our platform through MPI microbenchmarks, mini, and full MPI applications. Single hop, one way latency is \(1.3\) \(\mu\) s; approximately \(0.47\) \(\mu\) s out of these are attributed to network interface and the user-space library that exposes its functionality to the runtime. Latency over longer paths increases as expected, reaching \(2.55\) \(\mu\) s for a five-hop path. Bandwidth tests show that, for single hop, link utilization reaches \(82\%\) of the theoretical capacity. Microbenchmarks based on MPI collectives reveal that broadcast latency scales as expected when the number of participating ranks increases. We also implemented a custom MPI_Allreduce accelerator in the network interface, which reduces the latency of such collectives by up to \(88\%\) . We assess performance scaling through weak and strong scaling tests for HPCG, LAMMPS, and the miniFE mini application; for all these tests, parallelization efficiency is at least \(69\%\) , or better.
Read full abstract