Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence.

A Di Girolamo ,Nikodemas Tuckus,Vasilis Mageirakos,Luca Clissa,Simone Rossi Tisbeni,Panos Paparrigopoulos,Maria Grigorieva,Siarhei Padolski,Valentin Kuznetsov,Matteo Paltenghi,Micol Olocco,Leticia Decker ,T Diotalevi ,M Lassnig ,J Schovancová ,F Legger ,D Bonacorsi ,L Giommi ,T A Beermann ,Mayank Mohan Sharma ,M Boehler ,S Jézéquel ,D Giordano ,L Rinaldi ,T Javůrek ,D Höhn

doi:10.3389/fdata.2021.753409

Abstract

As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on “smart” solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.

Highlights

We formed the operational intelligence (OpInt) initiative to increase the level of automation in computing operations and reduce human interventions
We presented an overview of activities in varying stages of completeness in the areas of computing centers operation and the workflow and data management, which represent only three of the areas, where innovative approaches can bring substantial improvement while we benefit from the state-of-the-art technologies
In compliance with a variety of computer security and data privacy guidelines and policies applicable to our environment, we have been sharing the code developed in the scope of the various operational intelligence initiative projects in a GitHub repository5

Summary

INTRODUCTION

We formed the operational intelligence (OpInt) initiative to increase the level of automation in computing operations and reduce human interventions. It provides a seamless access to computing resources which include data storage capacity, processing power, sensors, and visualization tools, the resources that are capable to process over two million tasks daily, leveraging over one million computer cores and 1 exabyte of storage. Analysis of the operators’ actions is used to automate tasks such as creating support-requesting tickets to support centers or to suggest possible solutions to recurring issues Some of those efforts that were born out of the discussions are already being used to reduce operational costs.

OPINT AREAS OF INTEREST

Monitoring the Computing Infrastructures—Tools and Their Unification

Predictive and Reactive Site Operations

CMS Intelligent Alert System

Workflow Management—Jobs Buster

Error Messages Clustering

FTS Logs, Errors, and Failures Analysis

NLP Applications in Rucio

CERN MONIT Infrastructure

2.10 ATLAS Monitoring Infrastructure

2.11 CMS Monitoring Infrastructure

2.12 OpInt Framework

2.13 Anomaly Detection

2.14 Predictive Site Maintenance

2.15 HammerCloud Job Shaping

2.16 CMS Intelligent Alert System

2.17 Jobs Buster

2.18 Error Messages Clustering

2.19 Analysis of FTS Errors and Failures

2.20 NLP Applications in Rucio

FUTURE DEVELOPMENTS

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Big Data	Publication Date: Jan 7, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Big Data

Lead the way for us

Similar Papers

The contribution of the WLCG Tier-2 site in Prague to the global WLCG operations
Dagmar Adamová ... Petr Vokáč
-
Dagmar Adamová, et. al.Dagmar Adamová ... Petr Vokáč
22 Oct 2021
22 Oct 2021

Advances in service and operations for ATLAS data management
Graeme A Stewart ... Martin Barisits
Journal of Physics: Conference Series | VOL. 368
Graeme A Stewart, et. al.Graeme A Stewart ... Martin Barisits
21 Jun 2012
Journal of Physics: Conference Series | VOL. 368

Bringing the CMS distributed computing system into scalable operations
S Belforte ... J Flix
Journal of Physics: Conference Series | VOL. 219
S Belforte, et. al.S Belforte ... J Flix
01 Apr 2010
Journal of Physics: Conference Series | VOL. 219

CRIC: a unified information system for WLCG and beyond
Alexey Anisenkov ... M Litmaath
EPJ web of conferences | VOL. 214
Alexey Anisenkov, et. al.Alexey Anisenkov ... M Litmaath
01 Jan 2019
EPJ web of conferences | VOL. 214

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Big Data