Abstract

The LHCb experiment stores around 1011 collision events per year. A typical physics analysis deals with a final sample of up to 107 events. Event preselection algorithms (lines) are used for data reduction. Since the data are stored in a format that requires sequential access, the lines are grouped into several output file streams, in order to increase the efficiency of user analysis jobs that read these data. The scheme efficiency heavily depends on the stream composition. By putting similar lines together and balancing the stream sizes it is possible to reduce the overhead. We present a method for finding an optimal stream composition. The method is applied to a part of the LHCb data (Turbo stream) on the stage where it is prepared for user physics analysis. This results in an expected improvement of 15% in the speed of user analysis jobs, and will be applied on data to be recorded in 2017.

Highlights

  • To capture and analyze a large number of collision events, the LHCb experiment [1] relies on a multi stage data processing pipeline [2]

  • Since the data are stored in a format that requires sequential access, the lines are grouped into several output file streams, in order to increase the efficiency of user analysis jobs that read these data

  • We present a method for finding an optimal stream composition

Read more

Summary

Introduction

To capture and analyze a large number of collision events, the LHCb experiment [1] relies on a multi stage data processing pipeline [2]. Data in the FULL stream need to be reconstructed on distributed computing resources and are intended for further event selection, before being made available for user analysis. User analysis jobs run independently and usually require only a small subset of all events selected by the lines. The lines are grouped into streams, such that each file available for user analysis corresponds to a particular run-stream pair. For Tesla streams, the estimated time spent by the user jobs on disk access differs by a factor of 5 between the extreme variants. These factors must be estimated in order to construct a streaming scheme There is another constraint WLCG often uses tape storage systems, which generally do not cope well with storing and providing frequent access to many small files [6]. E l where Δel ∈ {0, 1} is the indicator whether event e passes line l, Pl ∈ [0, 1] is line l prescale value, Lls ∈ {0, 1} is the indicator whether line l belongs to stream s

Disk usage
Continuous loss
Grouping constraints
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call