Improvements to Supercomputing Service Availability Based on Data Analysis

Jae-Kook Lee,Junweon Yoon,Joon Woo,Guohua Li,Do-Sik An,Sung-Jun Kim,Min-Woo Kwon,Taeyoung Hong

doi:10.3390/app11136166

Jae-Kook Lee, Junweon Yoon + Show 6 more

Open Access

https://doi.org/10.3390/app11136166

Copy DOI

Abstract

As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements.

Highlights

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations
Many organizations operate a supercomputer to analyze the job scheduling log data of the supercomputing users. They can find the causes of problems and remedy them to improve service availability [1,2,3,4,5]
We describe our operational technique; uration, which includes the hardware and software structures of the Nurion system present our system’s mean time between interrupts (MTBI), which is an indicator of system stability [6,7,8]; and analyze our 3 provides main problem statements

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Supercomputers are used to perform computationally intensive simulations and analyses in fields such as climate research, molecular modeling, physical simulation, cryptography, geophysical modeling, automotive and aerospace design, financial modeling, and data mining. Ensuring the availability of large cluster systems, such as supercomputers, is challenging. Many organizations operate a supercomputer to analyze the job scheduling log data of the supercomputing users. They can find the causes of problems and remedy them to improve service availability [1,2,3,4,5]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improvements to Supercomputing Service Availability Based on Data Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: Jul 2, 2021
License type: CC BY 4.0

Similar Papers

Neuroscience Gateway � Cyberinfrastructure Providing Supercomputing Resources for Large Scale Computational Neuroscience Research
Majumdar Amitava ... Gleeson Padraig
Frontiers in Neuroinformatics | VOL. 10
Majumdar Amitava, et. al.Majumdar Amitava ... Gleeson Padraig
01 Jan 2015
Frontiers in Neuroinformatics | VOL. 10

Automating Job Monitoring System for an Ecosystem of High Performance Computing
Kajornsak Piyoungkorn ... Phithak Thaenkaew
-
Kajornsak Piyoungkorn, et. al.Kajornsak Piyoungkorn ... Phithak Thaenkaew
07 Nov 2017
07 Nov 2017

Hybrid Computer Cluster with High Flexibility
Shuo Liang ... Ibad Kureshi
-
Shuo Liang, et. al.Shuo Liang ... Ibad Kureshi
01 Sep 2012
01 Sep 2012

HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure
Antonio Pérez-Calero Yzquierdo ... Saqib Haleem
EPJ Web of Conferences | VOL. -
Antonio Pérez-Calero Yzquierdo, et. al.Antonio Pérez-Calero Yzquierdo ... Saqib Haleem
01 Jan 2024
EPJ Web of Conferences | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improvements to Supercomputing Service Availability Based on Data Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences