Abstract

CERN uses the world’s largest scientific computing grid, WLCG, for distributed data storage and processing. Monitoring of the CPU and storage resources is an important and essential element to detect operational issues in its systems, for example in the storage elements, and to ensure their proper and efficient function. The processing of experiment data depends strongly on the data access quality, as well as its integrity and both of these key parameters must be assured for the data lifetime. Given the substantial amount of data, O(200 PB), already collected by ALICE and kept at various storage elements around the globe, scanning every single data chunk would be a very expensive process, both in terms of computing resources usage and in terms of execution time. In this paper, we describe a distributed file crawler that addresses these natural limits by periodically extracting and analyzing statistically significant samples of files from storage elements, evaluates the results and is integrated with the existing monitoring solution, MonALISA.

Highlights

  • ALICE [1] stands for “A Large Ion Collider Experiment” and it is one of the 4 large experiments at the Large Hadron Collider (LHC) in the European Organization for Nuclear Research (CERN)

  • To meet the processing and storage requirements, which amount to approximately 150k CPU cores and 200 PB of storage, ALICE uses the WLCG [2] distributed Grid

  • In order to detect corrupted files and analyse the health and performance of storage elements (SEs), we have developed a file crawler, which periodically submits Grid jobs targeted at the computing element(s) closest to the analyzed SE

Read more

Summary

Introduction

ALICE [1] stands for “A Large Ion Collider Experiment” and it is one of the 4 large experiments at the Large Hadron Collider (LHC) in the European Organization for Nuclear Research (CERN). We describe a distributed file crawler that accesses data on a time-cyclic schedule with a quasi-random pattern. It gathers statistics like the number of files that are corrupted or inaccessible as well as the throughput and download latency of individual storage elements. A file is considered corrupted in two basic cases: MD5 sum or apparent size difference of the read file with the one stored in the ALICE Grid catalogue. Since a single corrupted data file in an analysis workflow can cause the loss of results from many other files processed by the same job, discarding the affected file improves the overall operating efficiency. It is important to the experiment because it ensures a continual high availability of the data sets

Related elements of the ALICE Grid software
Architecture of the file crawler system
Crawler timestamps and execution steps
Cleanup
Crawling prepare
Crawling process
Merging
Database update
Implementation details
Sample size calculation
Data gathered by the crawler
Status codes overview
Status codes analysis
Throughput analysis
PFN sample analysis
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.