Abstract

During three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of data being reprocessed every year. In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures hard to diagnose and repair quickly. As a result, Big Data processing on the Grid must tolerate a continuous stream of failures, errors and faults. While ATLAS fault-tolerance mechanisms improve the reliability of Big Data processing in the Grid, their benefits come at costs and result in delays making the performance prediction difficult. Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement. In ATLAS, cost monitoring and performance prediction became critical for the success of the reprocessing campaigns conducted in preparation for the major physics conferences. In addition, our Reliability Engineering approach supported continuous improvements in data reprocessing throughput during LHC data taking. The throughput doubled in 2011 vs. 2010 reprocessing, then quadrupled in 2012 vs. 2011 reprocessing. We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the foundation needed to scale up the Big Data processing technologies beyond the petascale.

Highlights

  • – We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the founda>on needed to scale up data reprocessing beyond petascale

  • – Reliability Engineering provides a framework for fundamental understanding of data reprocessing performance

  • – Distribu>on of tasks1 ranked by CPU >me used to recover from transient failures is not uniform:

Read more

Summary

Introduction

§ Scheduled LHC upgrades will increase the data taking rates tenfold, which increases demands on compu>ng resources. – a tenfold increase in WLCG resources for LHC upgrade needs is not an op>on. § The ATLAS experiment needs to exercise due diligence in evolving its Compu>ng Model to op>mally use the required resources. – A comprehensive end-­‐to-­‐end solu>on for the composi>on and execu>on of the reprocessing workflow within given CPU and storage constraints is necessary. § During three years of LHC data taking, the ATLAS collabora>on completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of “raw” data being reprocessed every year. – We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the founda>on needed to scale up data reprocessing beyond petascale

Reliability Engineering on the Grid
Waloddi Weibull
Reprocessing campaign
Duration of Reprocessing Campaigns
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call