Big data solutions for CMS computing monitoring and analytics

Christian Ariza-Porras,Valentin Kuznetsov,Federica Legger,C Doglioni,L Silvestris,W Kamleh,D Kim,P Jackson,G.A Stewart

doi:10.1051/epjconf/202024503022

Abstract

The CMS computing infrastructure is composed of several subsystems that accomplish complex tasks such as workload and data management, transfers, submission of user and centrally managed production requests. Till recently, most subsystems were monitored through custom tools and web applications, and logging information was scattered over several sources and typically accessible only by experts. In the last year, CMS computing fostered the adoption of common big data solutions based on open-source, scalable, and no-SQL tools, such as Hadoop, InfluxDB, and ElasticSearch, available through the CERN IT infrastructure. Such systems allow for the easy deployment of monitoring and accounting applications using visualisation tools such as Kibana and Grafana. Alarms can be raised when anomalous conditions in the monitoring data are met, and the relevant teams are automatically notified. Data sources from different subsystems are used to build complex workflows and predictive analytics (such as data popularity, smart caching, transfer latency), and for performance studies. We describe the full software architecture and data flow, the CMS computing data sources and monitoring applications, and show how the stored data can be used to gain insights into the various subsystems by exploiting scalable solutions based on Spark.

Highlights

The CMS experiment [1] at the Large Hadron Collider (LHC) exploits a tiered distributed computing infrastructure to process LHC data and produce Monte Carlo simulated events of relevant physics processes
We describe the full software architecture and data flow, the CMS computing data sources and monitoring applications, and show how the stored data can be used to gain insights into the various subsystems by exploiting scalable solutions based on Spark
The CMS offline and software computing team ported several of its monitoring applications from custom solutions to open source products such as ElasticSearch, InfluxDB and HDFS

Summary

Introduction

The CMS experiment [1] at the Large Hadron Collider (LHC) exploits a tiered distributed computing infrastructure to process LHC data and produce Monte Carlo simulated events of relevant physics processes. The main components are PhEDEx, the data transfer and location system; the Data Bookkeeping Service (DBS), a metadata catalog; and the Data Aggregation Service (DAS), designed to aggregate views and provide them to users and services [6] Data from these services are available to CMS collaborators through a web suite of services known as CMSWEB. The various computing services were monitored through custom tools and web applications, partly developed within the CMS computing community, and partly by the CERN IT department (monitoring the execution of jobs, data transfers, and site availability). Several solutions are available on the market to gather, store and process large amounts of data such as those produced by monitoring and logging services of computing applications. We describe MONIT, the monitoring infrastructure at CERN which is based on the open-source technologies listed above, the organisation of CMS monitoring applications based on MONIT, and future developments

The monitoring infrastructure at CERN

Organisation of CMS monitoring

Migration to the MONIT infrastructure

Current developments

Conclusion