OctopusDB: Flexible and Scalable Storage Management for Arbitrary Database Engines

Alekh Jindal

doi:10.22028/d291-26395

Abstract

We live in a dynamic age with the economy, the technology, and the people around us changing faster than ever before. Consequently, the data management needs in our modern world are much different than those envisioned by the early database inventors in the 70s. Today, enterprises face the challenge of managing ever-growing dataset sizes with dynamically changing query workloads. As a result, modern data managing systems, including relational as well as big data management systems, can no longer afford to be carved-in-stone solutions. Instead, data managing systems must inherently provide flexible data management techniques in order to cope with the constantly changing business needs. The current practice to deal with changing query workloads is to have a different specialized product for each workload type, e.g. row stores for OLTP workload, column stores for OLAP workload, streaming systems for streaming workload, and scan-oriented systems for shared query processing. However, this means that the enterprises have to now glue different data managing products together and copy data from one product to another, in order to support several query workloads. This has the additional penalty of managing a zoo of data managing systems in the first place, which is tedious, expensive, as well as counter-productive for modern enterprises. This thesis presents an alternative approach to supporting several query workloads in a data managing system. We observe that each specialized database product has a different data store, indicating that different query workloads work well with different data layouts. Therefore, a key requirement for supporting several query workloads is to support several data layouts. Therefore, in this thesis, we study ways to inject different data layouts into existing (and familiar) data managing systems. The goal is to develop a flexible storage layer which can support several query workloads in a single data managing system. We present a set of non-invasive techniques, coined Trojan Techniques, to inject different data layouts into a data managing system. The core idea of Trojan Techniques is to drop the assumption of having one fixed data store per data managing system. Trojan Techniques are non-invasive in the sense that they do not make heavy untenable changes to the system. Rather, they affect the data managing system from inside, almost at the core. As a result, Trojan Techniques bring significant improvements in query performance. It is interesting to note that in our approach we follow a design pattern that has been used in other non-invasive research works as well, such as PAX, fractal prefetching B+-trees, and RowCol. We propose four Trojan Techniques. First, Trojan Indexes add an additional index access path in Hadoop MapReduce. Second, Trojan Joins allow for co-partitioned joins in Hadoop MapReduce. Third, Trojan Layouts allow for row, column, or column-grouped layouts in Hadoop MapReduce. Together, these three techniques provide a highly flexible data storage layer for Hadoop MapReduce. Our final proposal, Trojan Columns, introduces columnar functionality in row-oriented relational databases, including closed source commercial databases, thus bridging the gap between row and column oriented databases. Our experimental results show that Trojan Techniques can improve the performance of Hadoop MapReduce by a factor of up to 18, and that of a top-notch commercial database product by a factor of up to 17.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

OctopusDB: Flexible and Scalable Storage Management for Arbitrary Database Engines

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Large-scale data mining analytics based on MapReduce

-

01 Jan 2014
01 Jan 2014

Comparative analysis of data management system
Chengdu Yin ... Xin Lin
-
Chengdu Yin, et. al.Chengdu Yin ... Xin Lin
01 Jan 2015
01 Jan 2015

Advancing Synthetic Ecology: A Database System to Facilitate Complex Ecological Meta‐Analyses
V Bala Chaudhary ... Lawrence L Walters
The Bulletin of the Ecological Society of America | VOL. 91
V Bala Chaudhary, et. al.V Bala Chaudhary ... Lawrence L Walters
01 Apr 2010
The Bulletin of the Ecological Society of America | VOL. 91

Performance Optimization for Short Job Execution in Hadoop MapReduce
...
Journal of Computer Research and Development | VOL. 51
, et. al. ...
15 Jun 2014
Journal of Computer Research and Development | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OctopusDB: Flexible and Scalable Storage Management for Arbitrary Database Engines

Abstract

Talk to us

Similar Papers