Abstract

In high performance computing (HPC) systems, optical network links are often utilized for the HPC networks of these systems, but they have a relatively high rate of failure compared to their electrical counterparts. Due to the high link failure rate, evaluating the impact of these failures on HPC workloads is of particular interest. We extended the Merlin network module of the Structural Simulation Toolkit (SST) in order to evaluate the impact of link failures on applications running on HPC systems which use dragonfly network topologies.We focus on dragonfly topologies as they are frequently found in HPC systems, including NERSC Cori and Edison systems.We demonstrate our changes to SST by providing a sample of performance results and routing statistics for a dragonfly network of 8,192 nodes and three HPC workloads with 1% of optical link failures. For the three motifs under consideration, we show that the impact of link failure is largely dependent on the underlying workloads running on the system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call