A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex

Z Baranowski,D Barberis,R Toebbicke,J Hrivnac,L Canali

doi:10.1088/1742-6596/898/6/062020

Z Baranowski, D Barberis + Show 3 more

Open Access

https://doi.org/10.1088/1742-6596/898/6/062020

Copy DOI

Abstract

This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions of event records, each of which consists of ∼100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. We report also on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.

Highlights

The ATLAS EventIndex [1] is a metadata catalogue of all real and simulated data produced by the ATLAS experiment [2], one of seven particle detectors constructed for the CERN Large Hadron
The ATLAS EventIndex system has to scale to the order of several 1010 events, be flexible in its schemas to accommodate a variety of quantities to be stored that could change in the future, use established and possibly open-source technologies and be “easy” to develop, deploy and operate
ATLAS EventIndex data into HBase each event attribute was stored in a separate cell, and the row key was composed as a concatenation of an event identification attributes

Summary

The ATLAS EventIndex project

The ATLAS EventIndex [1] is a metadata catalogue of all real and simulated data produced by the ATLAS experiment [2], one of seven particle detectors constructed for the CERN Large Hadron. It was designed in 2012-2013 and implemented in 2014; the first data (all LHC Run 1 data collected in 2009-2013) were loaded at the beginning of 2015

System requirements and use cases

Current architecture

Storage implementation

Event Index record content

Data access paths

Limitation of the Core Storage implementation

Evaluation of alternative modern storage approaches for Core Storage

Hardware and storage configuration

Evaluated formats and technologies

Measurement results

Space utilization

Ingestion speed

Random data lookup

Data processing speed

Summary of the evaluation

Hybrid system

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Oct 1, 2017
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

A quantitative review of data formats for HEP analyses
J Blomer
Journal of Physics: Conference Series | VOL. 1085
J BlomerJ Blomer
01 Sep 2018
Journal of Physics: Conference Series | VOL. 1085

Speed up the Search in Bitmap Based Compressed Sparse Arrays
Joseph Zalaket
-
Joseph ZalaketJoseph Zalaket
01 Jan 2009
01 Jan 2009

A Study on Data Compression Algorithms for Its Efficiency Analysis
Calvin Rodrigues ...
-
Calvin Rodrigues, et. al.Calvin Rodrigues ...
09 Oct 2021
09 Oct 2021

Data Compression Device Based on Modified LZ4 Algorithm
Weiqiang Liu ... Maire O'Neill
IEEE Transactions on Consumer Electronics | VOL. 64
Weiqiang Liu, et. al.Weiqiang Liu ... Maire O'Neill
01 Feb 2018
IEEE Transactions on Consumer Electronics | VOL. 64

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series