Automated Performance Modeling of HPC Applications Using Machine Learning

Jingwei Sun,Jiepeng Zhang,Guangzhong Sun,Shiyan Zhan,Yong Chen

doi:10.1109/tc.2020.2964767

Abstract

Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems confirm that our method performs well, with less than 20 percent prediction error on average.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automated Performance Modeling of HPC Applications Using Machine Learning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computers

Lead the way for us

Journal: IEEE Transactions on Computers	Publication Date: May 1, 2020
Citations: 61

Similar Papers

Automated Performance Modeling Based on Runtime Feature Detection and Machine Learning
Jingwei Sun ... Guangzhong Sun
-
Jingwei Sun, et. al.Jingwei Sun ... Guangzhong Sun
01 Dec 2017
01 Dec 2017

Stochastic bounds on execution times of parallel programs
N Yazia-Pekergin ... J.-M Vincent
IEEE Transactions on Software Engineering | VOL. 17
N Yazia-Pekergin, et. al.N Yazia-Pekergin ... J.-M Vincent
01 Jan 1991
IEEE Transactions on Software Engineering | VOL. 17

Predicting HPC parallel program performance based on LLVM compiler
Weizhe Zhang ... Marc Snir
Cluster Computing | VOL. 20
Weizhe Zhang, et. al.Weizhe Zhang ... Marc Snir
27 Dec 2016
Cluster Computing | VOL. 20

Relaxing cache coherence protocol with QOLB synchronizations
Jae Bum Lee ... Chu Shik Jhon
-
Jae Bum Lee, et. al. Jae Bum Lee ... Chu Shik Jhon
28 Apr 1997
28 Apr 1997

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated Performance Modeling of HPC Applications Using Machine Learning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computers