Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

Mohit Kumar,Saurabh Gupta,Tirthak Patel,Michael Wilder,Weisong Shi,Song Fu,Christian Engelmann,Devesh Tiwari

doi:10.1016/j.jpdc.2021.03.001

Abstract

Today’s High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of petaflops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90% accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Parallel and Distributed Computing	Publication Date: Mar 22, 2021
Citations: 2	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Similar Papers

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
Mohit Kumar ... Tirthak Patel
-
Mohit Kumar, et. al.Mohit Kumar ... Tirthak Patel
01 Jun 2018
01 Jun 2018

Node level Power Profiling and Thermal Management in HPC system
Sherin M A ... Prasanth P
-
Sherin M A, et. al. Sherin M A ... Prasanth P
01 Feb 2016
01 Feb 2016

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Christian Engelmann ... Rizwan Ashraf
-
Christian Engelmann, et. al.Christian Engelmann ... Rizwan Ashraf
16 Dec 2022
16 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing