Abstract

The HEP group at the University of Victoria operates a distributed cloud computing system for the ATLAS and Belle II experiments. The system uses private and commercial clouds in North America and Europe that run OpenStack, Open Nebula or commercial cloud software. It is critical that we record accounting information to give credit to cloud owners and to verify our use of commercial resources. We want to record the number of CPU-hours of the virtual machine. We continuously collect the CPU usage and an estimate of the HEPSpec06 units of the VM obtained during the boot of the VM and uploads it into an Elastic Search database. The information is processed and published as soon as it is available. The data is published in tables and plots in Kibana and as a cross check in ROOT. We have found the system to be useful beyond gathering accounting information and can be used for monitoring and diagnostic purposes. For example, we can use it to detect if the payload jobs are stuck in a waiting state for external information. We will report on the design and performance of the system, and show how it provides important accounting and monitoring information on a large distributed system.

Highlights

  • The research group for computing in high-energy physics ath the University of Victoria runs workloads for two experiments, for the ATLAS experiment at the Large Hadron Collider at CERN in Geneva, Switzerland and for the Belle II experiment at the SuperKEKB accelerator at KEK in Tsukuba, Japan

  • The second part of this paper describes the framework we have setup to accurately and promptly collect accounting information about the resources we utilize on all clouds

  • The distribution of number of jobs at the 10 largest sites is plotted. This is just to illustrate how our site compares to other sites, because we typically provide between a fourth up to a third of the overall computing resources to the Belle II experiment

Read more

Summary

Introduction

The research group for computing in high-energy physics ath the University of Victoria runs workloads for two experiments, for the ATLAS experiment at the Large Hadron Collider at CERN in Geneva, Switzerland and for the Belle II experiment at the SuperKEKB accelerator at KEK in Tsukuba, Japan. These workloads are run on distributed clouds all over the world. We use commercial clouds based on Azure, Amazon and Google Most of these clouds belong to other institutes, for proper accounting of the delivered CPU time to institutes, we need to accurately separate the CPU resources used on these clouds.

CloudScheduler
Accounting Framework
Quasi-online Monitoring
Quasi-online Job Monitoring in ATLAS
Quasi-online Job Monitoring in Belle II
How to transfer Secrets onto VMs
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call