Future Approach to tier-0 extension

B Jones,D Moreno García,S Traylen,G Mccance,C Cordeiro,D Giordano

doi:10.1088/1742-6596/898/8/082040

B Jones, D Moreno García + Show 4 more

Open Access

https://doi.org/10.1088/1742-6596/898/8/082040

Copy DOI

Abstract

The current tier-0 processing at CERN is done on two managed sites, the CERN computer centre and the Wigner computer centre. With the proliferation of public cloud resources at increasingly competitive prices, we have been investigating how to transparently increase our compute capacity to include these providers. The approach taken has been to integrate these resources using our existing deployment and computer management tools and to provide them in a way that exposes them to users as part of the same site. The paper will describe the architecture, the toolset and the current production experiences of this model.

Highlights

The constant requirement for increased compute capacity, concomitant with LHC upgrades and stability, have led to various investigations in how to take advantage of the growth of public cloud resources
The current tier-0 processing at CERN is done on two managed sites, the CERN computer centre and the Wigner computer centre
Initial investigations [2] into cloud providers focused on integration into specific WLCG workflows, with the VMs provided directly to experiments

Summary

Introduction

The constant requirement for increased compute capacity, concomitant with LHC upgrades and stability, have led to various investigations in how to take advantage of the growth of public cloud resources. It was desirable to both have a common submission framework, treating the extra resources as an extension of the pool, and to be able to identify the cloud resources and specify the jobs that were able (and willing) to run on them This was achieved using ClassAds, the HTCondor-CE [8] and the Job Router [9]. LHCb uses CEs to delineate sites, and in order for monitoring to be effective, required a separate CE for the cloud resources In this implementation, rather than have a CE routing according to the job’s ClassAd, a dedicated CE decorated all submitted jobs with the correct ClassAd and requirements for jobs to run on the external cloud resources. The Collector will need to use additional file descriptors, tuned with COLLECTOR.MAX_FILE_DESCRIPTORS [10]

Provisioning HTCondor worker nodes in clouds

Provisioning stack

Puppet and Foreman

Conclusions