Abstract

The automation of ATLAS Distributed Computing (ADC) operations is essential to reduce manpower costs and allow performance-enhancing actions, which improve the reliability of the system. In this perspective a crucial case is the automatic handling of outages of ATLAS computing sites storage resources, which are continuously exploited at the edge of their capabilities. It is challenging to adopt unambiguous decision criteria for storage resources of non-homogeneous types, sizes and roles. The recently developed Storage Area Automatic Blacklisting (SAAB) tool has provided a suitable solution, by employing an inference algorithm which processes history of storage monitoring tests outcome. SAAB accomplishes both the tasks of providing global monitoring as well as automatic operations on single sites. The implementation of the SAAB tool has been the first step in a comprehensive review of the storage areas monitoring and central management at all levels. Such review has involved the reordering and optimization of SAM tests deployment and the inclusion of SAAB results in the ATLAS Site Status Board with both dedicated metrics and views. The resulting structure allows monitoring the storage resources status with fine time-granularity and automatic actions to be taken in foreseen cases, like automatic outage handling and notifications to sites. Hence, the human actions are restricted to reporting and following up problems, where and when needed. In this work we show SAAB working principles and features. We present also the decrease of human interactions achieved within the ATLAS Computing Operation team. The automation results in a prompt reaction to failures, which leads to the optimization of resource exploitation.

Highlights

  • The Large Hadron Collider (LHC) at CERN has delivered colliding beams at the centre-of-massenergy of 7 TeV since March 2010 and at the center-of-mass-energy of 8 TeV since April up to December 2012

  • Storage Area Automatic Blacklisting (SAAB) production experience After the first production operations it is possible to conclude that the SAAB tool principles and implementation have proved successful in enhancing automated ATLAS Distributed Computing (ADC) activities by introducing a tool for automatic management of storage resources based upon their performance

  • A first, conceptual, improvement achieved consists in replacing human educated-guesses with unambiguous and reproducible decision criteria which leads to the heterogeneous ADC storage resources uniformity in performance-based blacklisting

Read more

Summary

Introduction

The Large Hadron Collider (LHC) at CERN has delivered colliding beams at the centre-of-massenergy of 7 TeV since March 2010 and at the center-of-mass-energy of 8 TeV since April up to December 2012. ADC storage resources are operated by the Distributed Data Management (DDM) system [5] which performs operations and monitoring of the sites storage elements.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call