Abstract

Containers became the de-facto standard to package and distribute modern applications and their dependencies. The HEP community demonstrates an increasing interest in such technology, with scientists encapsulating their analysis workflow and code inside a container image. The analysis is first validated on a small dataset and minimal hardware resources to then run at scale on the massive computing capacity provided by the grid. The typical approach for distributing containers consists of pulling their image from a remote registry and extracting it on the node where the container runtime (e.g., Docker, Singularity) runs. This approach, however, does not easily scale to large images and thousands of nodes. CVMFS has long been used for the efficient distribution of software directory trees at a global scale. In order to extend its optimized caching and network utilization to the distribution of containers, CVMFS recently implemented a dedicated container image ingestion service together with container runtime integrations. CVMFS ingestion is based on per-file deduplication, instead of the per-layer deduplication adopted by traditional container registries. On the client-side, CVMFS implements on-demand fetching of the chunks required for the execution of the container instead of the whole image.

Highlights

  • In recent years, container technologies have seen wide adoption by software developers, system administrators, and IT practitioners to the point of becoming the preferred way to package, distribute, and deploy applications

  • Among the available container runtimes, three are popular in the High Energy Physics (HEP) community: i) Singularity [5] has its roots in the scientific environment and is the most used for containerized jobs on the Worldwide LHC Computing Grid (WLCG); ii) containerd [6] implements the cri-o [7] interface used by Kubernetes [8] and integrates well with container orchestration tools; iii) Podman [9] has the ability to run rootless, is well integrated with the CentOS ecosystem, and provides an interface identical to the one offered by Docker

  • CernVM File System (CVMFS) is set up to publish to the local SSD disk, while source container images are provided by Docker Hub and by the GitLab Container Registry deployed at CERN

Read more

Summary

Introduction

Container technologies have seen wide adoption by software developers, system administrators, and IT practitioners to the point of becoming the preferred way to package, distribute, and deploy applications. Container images built and used in the HEP environment can reach tens of gigabytes in size and, even if pushed only once to the registry, they can potentially be pulled by thousands of computing nodes part of the Worldwide LHC Computing Grid (WLCG) [1] This puts additional load on both the network infrastructure from the container registry to the computing nodes and the storage capacity of each computing node, given that container images must be downloaded and unpacked into the local filesystem. This allows downloading only the files that are strictly needed for the execution of the container (previous studies [2] confirm our own findings that only a small percentage of the total image volume is used), saving network bandwidth and local storage space. Such cache is self-managed and files are automatically purged according to the least recently used policy

Use of Containers in the HEP community
Use of CVMFS in the HEP community
Server capabilities for container images ingestion
Manage image ingestion in CVMFS
Integration with container runtimes
Evaluation
Ingestion of layers
Ingestion of chains
Characterization of image repositories
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call