Mynodbcsv: lightweight zero-config database solution for handling very large CSV files.

Stanisław Adaszewski

doi:10.1371/journal.pone.0103319

Abstract

Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.

Highlights

When considering processing of big and wide data, emphasis is often put on custom solutions that promise performance and customizability traditionally unavailable to normalized solutions like Structured Query Language (SQL)-capable relational databases
It could be seen that performance of PostgreSQL was severely reduced when serving big TEXT-type column
In order to determine the root cause of this inefficiency I’ve run PostgreSQL with its data folder placed on a RAM disk created using ImDisk Virtual Disk Driver to remove any unfair advantage of mynodbcsv

Summary

Introduction

When considering processing of big and wide data, emphasis is often put on custom solutions (e.g. scripts in MATLAB/R/ Python, programs in C/C++, different NoSQL [5] solutions) that promise performance and customizability traditionally unavailable to normalized solutions like SQL-capable relational databases. It’s worth realizing the benefits of using a standardized language for querying the data Those include: shorter development time, maintainability, expressive and natural way of formulating queries, ease of sharing them with collaborators who need just to understand SQL to know the purpose of a query. With scripting approaches to big data even reading the data source is frequently not easy because of inefficiency of high-level languages in running parsers. In light of these facts, it seems that reasons stopping potential users from choosing a database approach to handling their data are: inability of the latter to accommodate modern dataset sizes (big data) and layouts (wide data), necessity to install appropriate software and move data into the system, as well as designing an appropriate database schema beforehand. As solutions satisfying needs of efficient adhoc access to computationally demanding datasets using standard languages like SQL come to existence (NoDB [6], mynodbcsv), this situation becomes likely to change

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Mynodbcsv: lightweight zero-config database solution for handling very large CSV files.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Journal: PLoS ONE	Publication Date: Jul 28, 2014
License type: CC BY 4.0

Similar Papers

Working with CSV Files
John Hunt
-
John HuntJohn Hunt
01 Jan 2019
01 Jan 2019

Wrangling messy CSV files by detecting row and type patterns
G J J Van Den Burg ... A Nazábal
Data Mining and Knowledge Discovery | VOL. 33
G J J Van Den Burg, et. al.G J J Van Den Burg ... A Nazábal
26 Jul 2019
Data Mining and Knowledge Discovery | VOL. 33

CSV-ANNOTATE: Generate annotated tables from CSV file
S M Hasan Mahmud ... Md Altab Hossin
-
S M Hasan Mahmud, et. al.S M Hasan Mahmud ... Md Altab Hossin
01 May 2018
01 May 2018

A new version of Visual tool for estimating the fractal dimension of images
I.V Grossu ... Al Jipa
Computer Physics Communications | VOL. 181
I.V Grossu, et. al.I.V Grossu ... Al Jipa
22 Dec 2009
Computer Physics Communications | VOL. 181

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mynodbcsv: lightweight zero-config database solution for handling very large CSV files.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE