Abstract
As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on “smart” solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.
Highlights
We formed the operational intelligence (OpInt) initiative to increase the level of automation in computing operations and reduce human interventions
We presented an overview of activities in varying stages of completeness in the areas of computing centers operation and the workflow and data management, which represent only three of the areas, where innovative approaches can bring substantial improvement while we benefit from the state-of-the-art technologies
In compliance with a variety of computer security and data privacy guidelines and policies applicable to our environment, we have been sharing the code developed in the scope of the various operational intelligence initiative projects in a GitHub repository5
Summary
We formed the operational intelligence (OpInt) initiative to increase the level of automation in computing operations and reduce human interventions. It provides a seamless access to computing resources which include data storage capacity, processing power, sensors, and visualization tools, the resources that are capable to process over two million tasks daily, leveraging over one million computer cores and 1 exabyte of storage. Analysis of the operators’ actions is used to automate tasks such as creating support-requesting tickets to support centers or to suggest possible solutions to recurring issues Some of those efforts that were born out of the discussions are already being used to reduce operational costs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have