Abstract

As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on “smart” solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.

Highlights

  • We formed the operational intelligence (OpInt) initiative to increase the level of automation in computing operations and reduce human interventions

  • We presented an overview of activities in varying stages of completeness in the areas of computing centers operation and the workflow and data management, which represent only three of the areas, where innovative approaches can bring substantial improvement while we benefit from the state-of-the-art technologies

  • In compliance with a variety of computer security and data privacy guidelines and policies applicable to our environment, we have been sharing the code developed in the scope of the various operational intelligence initiative projects in a GitHub repository5

Read more

Summary

INTRODUCTION

We formed the operational intelligence (OpInt) initiative to increase the level of automation in computing operations and reduce human interventions. It provides a seamless access to computing resources which include data storage capacity, processing power, sensors, and visualization tools, the resources that are capable to process over two million tasks daily, leveraging over one million computer cores and 1 exabyte of storage. Analysis of the operators’ actions is used to automate tasks such as creating support-requesting tickets to support centers or to suggest possible solutions to recurring issues Some of those efforts that were born out of the discussions are already being used to reduce operational costs.

OPINT AREAS OF INTEREST
Monitoring the Computing Infrastructures—Tools and Their Unification
Predictive and Reactive Site Operations
CMS Intelligent Alert System
Workflow Management—Jobs Buster
Error Messages Clustering
FTS Logs, Errors, and Failures Analysis
NLP Applications in Rucio
CERN MONIT Infrastructure
2.10 ATLAS Monitoring Infrastructure
2.11 CMS Monitoring Infrastructure
2.12 OpInt Framework
2.13 Anomaly Detection
2.14 Predictive Site Maintenance
2.15 HammerCloud Job Shaping
2.16 CMS Intelligent Alert System
2.17 Jobs Buster
2.18 Error Messages Clustering
2.19 Analysis of FTS Errors and Failures
2.20 NLP Applications in Rucio
FUTURE DEVELOPMENTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call