Abstract
Beijing Spectrometer (BESIII) experiment has produced hundreds of billions of events. The traditional event-wise accessing of BESIII Offline Software System is not effective for the selective accessing with low rate during a physics analysis. In this paper, an event-based data management system (EventDB) is introduced, which can effectively alleviate the problems of low efficiency of data processing and low utilization of resources. Firstly, an indexing system based on NoSQL database is designed. By extracting specified attributes of events, the events interested to the physicists are selected and stored into the database, whilst the real data of event is still stored in ROOT files. For those hot events, the real event data can also be cached into EventDB to improve the access performance. The data analysis workflow of HEP experiments is needed to change if the EventDB system is applied. The analysis program queries the corresponding event index from database, then get event data from database if the event is cached, or get data from ROOT files if it is not cached. Finally, the test on more than one hundred billion physics events shows the query speed was greatly improved over traditional file-based data management systems.
Highlights
As the scale of high-energy physics (HEP) experiments continues to expand, more and more data is produced
The EventDB system described in this paper focuses on the event index and pre-selection, and does not change the BESIII storage and analysis model
We have setup a test bed composed of two sites including Beijing and Chengdu to evaluate the performance of the EventDB system
Summary
As the scale of high-energy physics (HEP) experiments continues to expand, more and more data is produced. Most of high-energy physics experiment data are managed in the granularity of file, and each file contains several events. File-based data management are facing a lot of challenges with the rapid growth of experiment data and the emergence of new technologies. 2) If one site does not have sufficient storage space and enough network bandwidth, it is difficult to run data analysis tasks which need a large of amount of input data. In this case, it required that only a subset of data are transferred on demand.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.