Abstract
Apache Mesos is a resource management system for large data centres, initially developed by UC Berkeley, and now maintained under the Apache Foundation umbrella. It is widely used in the industry by companies like Apple, Twitter, and Airbnb and it is known to scale to 10 000s of nodes. Together with other tools of its ecosystem, such as Mesosphere Marathon or Metronome, it provides an end-to-end solution for datacenter operations and a unified way to exploit large distributed systems. We present the experience of the ALICE Experiment Offline & Computing in deploying and using in production the Apache Mesos ecosystem for a variety of tasks on a small 500 cores cluster, using hybrid OpenStack and bare metal resources. We will initially introduce the architecture of our setup and its operation, we will then describe the tasks which are performed by it, including release building and QA, release validation, and simple Monte Carlo production. We will show how we developed Mesos enabled components (called “Mesos Frameworks”) to carry out ALICE specific needs. In particular, we will illustrate our effort to integrate Work Queue, a lightweight batch processing engine developed by University of Notre Dame, which ALICE uses to orchestrate release validation. Finally, we will give an outlook on how to use Mesos as resource manager for DDS, a software deployment system developed by GSI which will be the foundation of the system deployment for ALICE next generation Online-Offline (O2).
Highlights
We present the experience of the ALICE Experiment Offline & Computing in deploying and using in production the Apache Mesos ecosystem for a variety of tasks on a small 500 cores cluster, using hybrid OpenStack and bare metal resources
We will give an outlook on how to use Mesos as resource manager for DDS, a software deployment system developed by GSI which will be the foundation of the system deployment for ALICE generation Online-Offline (O2)
The main ones used for software offline purposes are the Release Validation cluster [1] and the PROOF-based Virtual Analysis Facility [2]: both clusters are virtual and run on resources supplied by the CERN OpenStack instance [3]
Summary
ALICE Mesos cluster setup at CERN Mesos has a two-level architecture: a certain number of masters control the infrastructure and the registered frameworks, while a large number of agents, one per worker node, are in charge of deploying, running and garbage-collecting tasks. All three approaches with multiple frameworks can work together on a single set of Mesos resources, as Mesos was designed to orchestrate different use cases at the same time.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.