Automating Job Monitoring System for an Ecosystem of High Performance Computing

Kajornsak Piyoungkorn,Chalee Vorakulpipat,Natsuda Kasisopha,Phithak Thaenkaew

doi:10.1145/3167020.3167062

Abstract

Many countries have founded national high performance computing center aiming to provide computational resources to their scientists upon requests. The resources provided are not efficient because the job requests are not relative to the real use leading to unnecessary resource consumption. In this paper, we present a method to monitor and manage High Performance Computing (HPC) resources more efficiently. Usually, the HPC resources are managed by a Portable Batch System (PBS) as the Job Management System (JMS) for effective job scheduling and resource allocation. However, the HPC resources often engage in inefficient job requests. For instance, a job request may have for four processors running per node for two hours, but the actual usage engages four processors per node for one hour. Hence, the HPC resources lose an hour of productivity. As a consequence, the queues for job execution are longer. The automated job monitoring system proposed in this paper would scan all the jobs on every HPC Node and compare the job requests conditions with preset criteria. If the conditions meet the criteria, then the inefficient jobs are forced to cancel from the HPC queue. The results show that more HPC resources are available for executing other jobs in the queue, leading to saved resources in the HPC environment and Stabilization of HPC hardware, promoting an HPC infrastructure ecosystem.

Full Text