Abstract

As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements.

Highlights

  • Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations

  • Many organizations operate a supercomputer to analyze the job scheduling log data of the supercomputing users. They can find the causes of problems and remedy them to improve service availability [1,2,3,4,5]

  • We describe our operational technique; uration, which includes the hardware and software structures of the Nurion system present our system’s mean time between interrupts (MTBI), which is an indicator of system stability [6,7,8]; and analyze our 3 provides main problem statements

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Supercomputers are used to perform computationally intensive simulations and analyses in fields such as climate research, molecular modeling, physical simulation, cryptography, geophysical modeling, automotive and aerospace design, financial modeling, and data mining. Ensuring the availability of large cluster systems, such as supercomputers, is challenging. Many organizations operate a supercomputer to analyze the job scheduling log data of the supercomputing users. They can find the causes of problems and remedy them to improve service availability [1,2,3,4,5]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.