Abstract

AbstractThere has been much effort in offloading MPI collective operations into hardware. But while NIC‐based collective acceleration is well‐studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in‐switch collective accelerator, MPI‐FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two‐level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub‐communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA‐centric cluster. A sample MPI‐FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high‐radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI‐FPGA is fully integrated into MPICH and is transparent to MPI applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call