Managing the CERN Batch System with Kubernetes

Luis Fernandez Alvarez, Ben Jones + Show 8 more

Open Access

https://doi.org/10.1051/epjconf/202024507048

Copy DOI

Abstract

The CERN Batch Service faces many challenges in order to get ready for the computing demands of future LHC runs. These challenges require that we look at all potential resources, assessing how efficiently we use them and that we explore different alternatives to exploit opportunistic resources in our infrastructure as well as outside of the CERN computing centre. Several projects, like BEER, Helix Nebula Science Cloud and the new OCRE project, have proven our ability to run batch workloads on a wide range of non-traditional resources. However, the challenge is not only to obtain the raw compute resources needed but how to define an operational model that is cost and time efficient, scalable and flexible enough to adapt to a heterogeneous infrastructure. In order to tackle both the provisioning and operational challenges it was decided to use Kubernetes. By using Kubernetes we benefit from a de-facto standard in containerised environments, available in nearly all cloud providers and surrounded by a vibrant ecosystem of open-source projects. Leveraging Kubernetes’ built-in functionality, and other open-source tools such as Helm, Terraform and GitLab CI, we have deployed a first cluster prototype which we discuss in detail. The effort has simplified many of the existing operational procedures we currently have, but has also made us rethink established procedures and assumptions that were only valid in a VM-based cloud environment. This contribution presents how we have adopted Kubernetes into the CERN Batch Service, the impact its adoption has in daily operations, a comparison on resource usage efficiency and the experience so far evolving our infrastructure towards this model.

Highlights

The Batch Service, part of CERN’s IT department, is responsible for providing Tier-0 compute power to the Worldwide LHC Computing Grid (WLCG)
Terraform manifests are stored in GitLab and executed from GitLab Continuous Integration/Continuous Deployment (CI/CD) when there is a change to the manifests
In one particular case —LHCb GEN-SIM— it was possible to run the benchmark without relying on the image with cached data and access directly CernVM File System (CVMFS) to get the required data

Summary

Introduction

The Batch Service, part of CERN’s IT department, is responsible for providing Tier-0 compute power to the Worldwide LHC Computing Grid (WLCG). Using HTCondor [1] as the job scheduling system the service currently offers more than 200K cores of compute power to 500 monthly unique users These resources are provisioned in 20K virtual machines, belonging to more than 40 OpenStack projects, and located in multiple data centers. This forms a heterogeneous pool with different available configurations —hardware, operating systems, kernel versions, virtualisation technologies, physical location—. The primary goal at the time was to redesign the toolkit used in the CERN Computer Centre, to benefit from open source technologies as well as adopting cloud technologies The outcome of this effort was the Agile Infrastructure project; the ecosystem of tools and procedures that have been the basis of efficient IT operations ever since. The details of such prototype are described in this article

Prototype

Operations

Bootstrap

Node discovery and authentication

Upgrades

Benchmarking

Fine-tuning the deployment

Results

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ web of conferences	Publication Date: Jan 1, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Managing the CERN Batch System with Kubernetes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences

Lead the way for us

Similar Papers

Method for Determining Production and Resource Use Efficiency
A F Zverev ... S V Trufanova
-
A F Zverev, et. al.A F Zverev ... S V Trufanova
01 Jan 2020
01 Jan 2020

Analysis of the influence of external factors on efficiency of use of resource potential and economic growth of the region
M P Vasiliev
Proceedings of the Voronezh State University of Engineering Technologies | VOL. 79
M P VasilievM P Vasiliev
01 Jan 2017
Proceedings of the Voronezh State University of Engineering Technologies | VOL. 79

Assessment of Investment Resources of the Region in the Strategic Context
Svetlana Gerasimova ... Lyudmila Borshch
Региональная экономика. Юг России | VOL. -
Svetlana Gerasimova, et. al.Svetlana Gerasimova ... Lyudmila Borshch
01 Apr 2019
Региональная экономика. Юг России | VOL. -

Дидактичний потенціал електронних освітніх ресурсів у системі неперервної освіти
Balalaieva O
Humanitarian studios: pedagogics, psychology, philosophy | VOL. 12
Balalaieva OBalalaieva O
01 Dec 2021
Humanitarian studios: pedagogics, psychology, philosophy | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Managing the CERN Batch System with Kubernetes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences