Scheduling ML and HPC Jobs with Shoc Platform over Kubernetes

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Scheduling ML and HPC Jobs with Shoc Platform over Kubernetes

Similar Papers
  • Conference Article
  • Cite Count Icon 1
  • 10.18725/oparu-9865
Predictability of Resource Intensive Big Data and HPC Jobs in Cloud Data Centres
  • Aug 13, 2018
  • Christopher B Hauser + 2 more

Cloud data centres share physical resources at the same time with multiple users, which can lead to resource interferences. Especially with resource intensive computations like HPC or big data processing jobs, neighbouring applications in a cloud data centre may experience less performance of their assigned virtual resources. This work evaluates the predictability of such resource intensive jobs in principle. The assumption is, that the execution behaviour of such computations depends on the computation and the environment parameters. From these two influencing factors, the predictability is the outcome of removing the hardware dependent environment parameters from the observed execution behaviour, in order to compute any other execution behaviour for computations with similar computation parameters but on a different environment. The assumptions are analysed and evaluated with the HPC application Molpro.

  • Conference Article
  • 10.1109/hipc.2018.00039
Why do Users Kill HPC Jobs?
  • Dec 1, 2018
  • Venkatesh-Prasad Ranganath + 1 more

Given the cost of HPC clusters, making best use of them is crucial to improve infrastructure ROI. Likewise, reducing failed HPC jobs and related waste in terms of user wait times is crucial to improve HPC user productivity (aka human ROI). While most efforts (e.g.,debugging HPC programs) explore technical aspects to improve ROI of HPC clusters, we hypothesize non-technical (human) aspects are worth exploring to make non-trivial ROI gains; specifically, understanding non-technical aspects and how they contribute to the failure of HPC jobs. In this regard, we conducted a case study in the context of Beocat cluster at Kansas State University. The purpose of the study was to learn the reasons why users terminate jobs and to quantify wasted computations in such jobs in terms of system utilization and user wait time. The data from the case study helped identify interesting and actionable reasons why users terminate HPC jobs. It also helped confirm that user terminated jobs may be associated with non-trivial amount of wasted computation, which if reduced can help improve the ROI of HPC clusters.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3219104.3219121
Automatic Characterization of HPC Job Parallel Filesystem I/O Patterns
  • Jul 22, 2018
  • Joseph P White + 5 more

As part of the NSF funded XMS project, we are actively researching automatic detection of poorly performing HPC jobs. To aid the analysis we have generated a taxonomy of the temporal I/O patterns for HPC jobs. In this paper we describe the design of temporal pattern characterization algorithms for HPC job I/O. We have implemented these algorithms in the Open XDMoD job analysis framework. These I/O classifications include periodic patterns and a variety of characteristic non-periodic patterns. We present an analysis of the I/O patterns observed on the/scratch filesystem on an academic HPC cluster. This type of analysis can be extended to other HPC usage data such as memory, CPU and interconnect usage. Ultimately this analysis will be used to improve HPC throughput and efficiency by, for example, automatically identifying anomalous HPC jobs.

  • Conference Article
  • Cite Count Icon 15
  • 10.1109/clustr.2006.311855
JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management
  • Jan 1, 2006
  • K Uhlemann + 2 more

Most of today's HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as availability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.

  • Conference Article
  • Cite Count Icon 26
  • 10.1109/sc.2016.55
A Data Driven Scheduling Approach for Power Management on HPC Systems
  • Nov 1, 2016
  • Sean Wallace + 6 more

Modern schedulers running on HPC systems traditionally consider the number of resources and the time requested for each job that is to be executed when making scheduling decisions. Until recently this has been sufficient, however as systems get larger, other metrics like power consumption become necessary to ensure system stability. In this paper, we propose a data driven scheduling approach for controlling the power consumption of the entire system under any user defined budget. Here, “data driven” means that our approach actively observes, analyzes, and assesses power behaviors of the system and user jobs to guide scheduling decisions for power management. This design is based on the key observation that HPC jobs have distinct power profiles. Our work contains an empirical analysis of workload power characteristics on a production system, dynamic learner to estimate the job power profile for scheduling, and an online power-aware scheduler for managing the overall system power. Using real workload traces, we demonstrate that our design effectively controls system power consumption while minimizing the impact on system utilization.

  • Research Article
  • 10.17608/k6.auckland.11959329.v1
Webinar: Making the most of your NeSI HPC allocation
  • Mar 10, 2020
  • A Jonathan Shaw

If effectively anticipating your job’s HPC resource requirements is a skill that you want to develop, come along to NeSI’s next “Quick Tips” webinar: Make the most of your NeSI HPC allocation.NeSI’s Anthony Shaw will demonstrate how you can easily monitor the efficiency of your HPC jobs, and why doing so could help you reduce your job’s queueing time.This free, 1 hour webinar will cover:Which SLURM commands can help you view how efficiently your job ran (eg. sacct),what does an efficient job look like in terms of CPU utilisation, andtips for optimising job configuration and reducing your queue timeThis webinar is for anyone working on NeSI, but is especially helpful for those looking to reduce the time their projects spend in the queue. (More efficient jobs will have less effect on your Fair Share score, resulting in shorter queue times)

  • Conference Article
  • 10.1145/3332186.3332214
Slicing and Dicing OpenHPC Infrastructure
  • Jul 28, 2019
  • Satrio Husodo + 4 more

University research computing centers are increasingly faced with the need to support applications that are better suited for cloud infrastructure than HPC infrastructure. A common approach is to shoehorn cloud-based applications onto the university's existing HPC system, which has been done with varying levels of success. Another approach as been to create stand-alone HPC systems and private cloud systems, resulting in ineffective use of resources. A more recent approach has been to use hybrid systems where the HPC system bursts excess jobs to private cloud nodes configured as bare-metal nodes built from the same (expensive) hardware as the HPC system.This paper explores another model, namely the use of private cloud infrastructure (built from inexpensive commodity networks and storage systems) to host both HPC jobs and VMs simultaneously Utilizing VMs allows these emerging applications to leverage cloud frameworks specifically designed for them (e.g., OpenStack, Kubernetes, Mesos, Hadoop, and Spark), while at the same time effectively supporting a growing percentage of the HPC jobs (e.g., single node jobs, and embarrassingly parallel jobs). Because the system can be constructed from commodity cloud networks and storage, it makes cost-effective use of the resources as opposed to HPC systems used to run jobs that do not use (waste) its expensive resources.To demonstrate the advantages of using cloud infrastructure for both cloud applications and HPC applications, we describe a system that can dynamically launch OpenHPC systems on commodity OpenStack infrastructure. Moreover, users can use the system to deploy personal OpenHPC clusters, customized to their application's needs (e.g., number of nodes, cores per node, memory per node). We have used the system to effectively run OpenHPC work-loads on a cluster of large memory OpenStack nodes, allowing users to create, for example, a large memory HPC-style cluster of 500 GB nodes running OpenHPC, and a cluster of 1TB VMs operating simultaneously. Performance degradation due to virtualization has been insignificant, particularly when compared to the advantages of being able to use optimized frameworks running on cost-effective hardware.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/sc41404.2022.00045
Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk
  • Nov 1, 2022
  • Bartłomiej Przybylski + 5 more

Modern HPC workload managers and their careful tuning contribute to the high utilization of HPC clusters. However, due to inevitable uncertainty it is impossible to completely avoid node idleness. Although such idle slots are usually too short for any HPC job, they are too long to ignore them. Function-as-a-Service (FaaS) paradigm promisingly fills this gap, and can be a good match, as typical FaaS functions last seconds, not hours. Here we show how to build a FaaS infrastructure on idle nodes in an HPC cluster in such a way that it does not affect the performance of the HPC jobs significantly. We dynamically adapt to a changing set of idle physical machines, by integrating open-source software Slurm and OpenWhisk. We designed and implemented a prototype solution that allowed us to cover up to 90% of the idle time slots on a 50k-core cluster that runs production workloads.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/qrs-c.2018.00069
Predictability of Resource Intensive Big Data and HPC Jobs in Cloud Data Centres
  • Jul 1, 2018
  • Christopher B Hauser + 2 more

Predictability of Resource Intensive Big Data and HPC Jobs in Cloud Data Centres

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/sc41406.2024.00062
MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs
  • Nov 17, 2024
  • Francesco Antici + 4 more

MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-031-74430-3_10
Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems
  • Dec 21, 2024
  • Luc Angelelli + 2 more

Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/pdp52278.2021.00014
Job Classification Through Long-Term Log Analysis Towards Power-Aware HPC System Operation
  • Mar 1, 2021
  • Yuichi Tsujita + 4 more

High utilization of HPC system resources under constraints in electric power consumption or I/O workload is one of the primary goals to deal with high demand from application users. Utilization of CPU and memory, which is tightly related to electric power consumption, is counterpart metric of I/O activities in most HPC jobs. Towards higher utilization of HPC systems under restriction in management for electric power consumption and I/O activities, we need to care not to have hot-spots in power consumption or I/O operations because such situation leads to unstable system operation by exceeding capability of electric power supply or the I/O subsystem in such hot-spots. Analysis of a huge scale of log data collected from the K computer has revealed high correlation between I/O activities and CPU and memory utilization in some specific compute node layouts, showing unique characteristics of HPC jobs such as computation intensive or I/O-intensive. It has turned out that classifying jobs in terms of required electric power can divide into two groups, jobs consuming high electric power and I/O-intensive jobs. We have succeeded in job classification by achieving high correctness using machine learning approach, and we have confirmed effectiveness of the classification towards power-aware system operation in our next HPC system, the supercomputer Fugaku.

  • Conference Article
  • Cite Count Icon 6
  • 10.5555/3014904.3014979
A data driven scheduling approach for power management on HPC systems
  • Nov 13, 2016
  • Sean Wallace + 6 more

Modern schedulers running on HPC systems traditionally consider the number of resources and the time requested for each job that is to be executed when making scheduling decisions. Until recently this has been sufficient, however as systems get larger, other metrics like power consumption become necessary to ensure system stability. In this paper, we propose a data driven scheduling approach for controlling the power consumption of the entire system under any user defined budget. Here, “data driven” means that our approach actively observes, analyzes, and assesses power behaviors of the system and user jobs to guide scheduling decisions for power management. This design is based on the key observation that HPC jobs have distinct power profiles. Our work contains an empirical analysis of workload power characteristics on a production system, dynamic learner to estimate the job power profile for scheduling, and an online power-aware scheduler for managing the overall system power. Using real workload traces, we demonstrate that our design effectively controls system power consumption while minimizing the impact on system utilization.

  • Research Article
  • 10.11578/dc.20220608.1
JobQueue-PG: A Task Queue for Coordinating Varied Tasks Across Multiple HPC Resources and HPC Jobs
  • May 13, 2022
  • OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information)
  • Charles Tripp + 3 more

JobQueue-PG: A Task Queue for Coordinating Varied Tasks Across Multiple HPC Resources and HPC Jobs

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/icppw.2014.28
Dynamic Virtual Machine Placement for Cloud Computing Environments
  • Sep 1, 2014
  • Xinying Zheng + 1 more

With the increasing application of large scale cloud computing platforms, how to place virtual machine (VM) requests into available computing servers to reduce energy consumption has become a hot research subject. However, the current VM placement approaches are still not effective for live migrations with dynamic characters. In this paper, we proposed a dynamic VM placement scheme for energy efficient resource allocation in a cloud platform. Our dynamic VM placement scheme supports VM requests scheduling and live migration to minimize the number of active nodes in order to save the overall energy in a virtualized data center. Specifically, the proposed VM placement scheme is built on a statistical mathematic framework, and it incorporates all the virtualization overheads in the dynamic migration process. In addition, our scheme considers other important factors in related to power consumption, and it is ready to be extended with more considerations on users demand. We conduct extensive evaluations based on HPC jobs in a simulated environment. The results prove the effectiveness of our scheme.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.