Scale out databases for CERN use cases

Zbigniew Baranowski,Daniel Lanza Garcia,Kacper Surdy,Luca Canali,Maciej Grzybek

doi:10.1088/1742-6596/664/4/042002

Zbigniew Baranowski, Daniel Lanza Garcia + Show 3 more

Open Access

https://doi.org/10.1088/1742-6596/664/4/042002

Copy DOI

Abstract

Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.

Highlights

CERN and high energy physics in general, have developed over the years many techniques to store and manage large amounts of data
Unlike physics data, which are stored on tape and on disk-based file systems, controls data coming from LHC subsystems are inserted into relational databases
Cloudera Impala is a storage engine on top of Hadoop Distributed File System (HDFS) that has demonstrated very good scalability in our tests. Spark is another engine that can be used on top of Hadoop and that has shown very good scalability for our use cases

Summary

Introduction

CERN and high energy physics in general, have developed over the years many techniques to store and manage large amounts of data. 3. Using of Hadoop for accelerating data analytics at CERN Given the variety of available solutions a series of tests and investigations have been performed to find a right set of technologies suitable for processing large time series datasets of interest for CERN use cases. Bucketing for an efficient data access Even though the Hadoop platform offers high scalable throughput for data access and processing there is a family of queries that will execute relatively slowly compared to RDBMS systems This applies to the cases when only one or few variables are being accessed. When grouping daily data into 10 buckets (by using mod(variable_id,10)) for the LHC logging service, a significant performance improvement is observed for single-variable data selection: only 4GB instead of 40GB needs to be read from daily partitions, which results in reduction of query execution time by a factor 10 (1s instead of 10s in our example)

Benefits from a columnar store

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Dec 1, 2015
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Scale out databases for CERN use cases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

Integrating relational and object-oriented database systems using a metaclass concept
Wolfgang Klas ... Gisela Fischer
Journal of Systems Integration | VOL. 4
Wolfgang Klas, et. al.Wolfgang Klas ... Gisela Fischer
01 Dec 1994
Journal of Systems Integration | VOL. 4

Classifying schematic and data heterogeneity in multidatabase systems
W Kim ... J Seo
Computer | VOL. 24
W Kim, et. al.W Kim ... J Seo
01 Dec 1991
Computer | VOL. 24

On unifying relational and object-oriented database systems
Won Kim
-
Won KimWon Kim
29 Jun 1992
29 Jun 1992

Capturing chemical information in an extended relational database system
Thomas R Hagadone ... Michael S Lajiness
Tetrahedron Computer Methodology | VOL. 1
Thomas R Hagadone, et. al.Thomas R Hagadone ... Michael S Lajiness
01 Jan 1987
Tetrahedron Computer Methodology | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scale out databases for CERN use cases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series