Striped Data Analysis Framework

Oliver Gutsche,Igor Mandrichenko,C Doglioni,P Jackson,G.A Stewart,D Kim,W Kamleh,L Silvestris

doi:10.1051/epjconf/202024506042

Oliver Gutsche, Igor Mandrichenko + Show 6 more

Open Access

https://doi.org/10.1051/epjconf/202024506042

Copy DOI

Abstract

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.

Highlights

High Energy Physics data analysis is an iterative process of distilling large amounts of data to extract physical information and presenting it in a most optimal way
The physicist repeats the process of refining the data filtering, data compilation and information representation many times, and reducing the time of individual iteration reduces the time of the overall analysis process and allows the physicist to deliver the results sooner
Once the dataset is reduced, the physicist can run their analysis code over smaller amounts of data in a more efficient way, but data reduction contributes to the overall analysis process and slows it down

Summary

Introduction

High Energy Physics data analysis is an iterative process of distilling large amounts of data to extract physical information and presenting it in a most optimal way. The physicist repeats the process of refining the data filtering, data compilation and information representation many times, and reducing the time of individual iteration reduces the time of the overall analysis process and allows the physicist to deliver the results sooner. HEP data is stored in files and the analysis is essentially a repeating process of running the analysis software over a large set of files. In order to reduce the iteration time, the physicists reduce the initial dataset (set of files) to a smaller set by using 2 methods: Skimming is a process of pre-filtering potentially interesting events so that there are fewer events to process during each run. Once the dataset is reduced, the physicist can run their analysis code over smaller amounts of data in a more efficient way, but data reduction contributes to the overall analysis process and slows it down. We propose to move from the event loop style of analysis to vector-based calculations, which, when combined with functional and/or declarative algorithm representation, can be moved from CPU to GPU or other SIMD processing platforms as the underlying computing fabric

Striped Data Representation

Conclusion