Abstract

Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.

Highlights

  • When considering processing of big and wide data, emphasis is often put on custom solutions that promise performance and customizability traditionally unavailable to normalized solutions like Structured Query Language (SQL)-capable relational databases

  • It could be seen that performance of PostgreSQL was severely reduced when serving big TEXT-type column

  • In order to determine the root cause of this inefficiency I’ve run PostgreSQL with its data folder placed on a RAM disk created using ImDisk Virtual Disk Driver to remove any unfair advantage of mynodbcsv

Read more

Summary

Introduction

When considering processing of big and wide data, emphasis is often put on custom solutions (e.g. scripts in MATLAB/R/ Python, programs in C/C++, different NoSQL [5] solutions) that promise performance and customizability traditionally unavailable to normalized solutions like SQL-capable relational databases. It’s worth realizing the benefits of using a standardized language for querying the data Those include: shorter development time, maintainability, expressive and natural way of formulating queries, ease of sharing them with collaborators who need just to understand SQL to know the purpose of a query. With scripting approaches to big data even reading the data source is frequently not easy because of inefficiency of high-level languages in running parsers. In light of these facts, it seems that reasons stopping potential users from choosing a database approach to handling their data are: inability of the latter to accommodate modern dataset sizes (big data) and layouts (wide data), necessity to install appropriate software and move data into the system, as well as designing an appropriate database schema beforehand. As solutions satisfying needs of efficient adhoc access to computationally demanding datasets using standard languages like SQL come to existence (NoDB [6], mynodbcsv), this situation becomes likely to change

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.