Abstract

In high performance computing (HPC) systems, optical network links are often utilized for the HPC networks of these systems, but they have a relatively high rate of failure compared to their electrical counterparts. Due to the high link failure rate, evaluating the impact of these failures on HPC workloads is of particular interest. We extended the Merlin network module of the Structural Simulation Toolkit (SST) in order to evaluate the impact of link failures on applications running on HPC systems which use dragonfly network topologies.We focus on dragonfly topologies as they are frequently found in HPC systems, including NERSC Cori and Edison systems.We demonstrate our changes to SST by providing a sample of performance results and routing statistics for a dragonfly network of 8,192 nodes and three HPC workloads with 1% of optical link failures. For the three motifs under consideration, we show that the impact of link failure is largely dependent on the underlying workloads running on the system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.