E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Burak Aksar,Omar Aaziz,Ayse K Coskun,Jim Brandt,Vitus J Leung,Benjamin Schwaller,Manuel Egele

doi:10.1007/978-3-030-85665-6_5

Abstract

In today’s High-Performance Computing (HPC) systems, application performance variations are among the most vital challenges as they adversely affect system efficiency, application performance, and cost. System administrators need to identify the anomalies that are responsible for performance variation and take mitigating actions. One can perform manual root-cause analysis on telemetry data collected by HPC monitoring infrastructures to analyze performance variations. However, manual analysis methods are time-intensive and limited in impact due to the increasing complexity of HPC systems and terabyte/day-sized telemetry data. State-of-the-art approaches use machine learning-based methods to diagnose performance anomalies automatically. This paper deploys an end-to-end machine learning framework that diagnoses performance anomalies on compute nodes on a 1488-node production HPC system. We demonstrate job and node-level anomaly diagnosis results with the Grafana frontend interface at runtime. Furthermore, we discuss challenges and design decisions for the deployment.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems
Burak Aksar ... Manuel Egele
-
Burak Aksar, et. al.Burak Aksar ... Manuel Egele
01 Jan 2020
01 Jan 2020

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

HPAS
Emre Ates ... Ayse K Coskun
-
Emre Ates, et. al.Emre Ates ... Ayse K Coskun
05 Aug 2019
05 Aug 2019

Diagnosing Performance Variations in HPC Applications Using Machine Learning
Ozan Tuncer ... Ata Turk
-
Ozan Tuncer, et. al.Ozan Tuncer ... Ata Turk
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Abstract

Talk to us

Similar Papers