Processing genome scale tabular data with wormtable

Jerome Kelleher,Rob W Ness,Daniel L Halligan

doi:10.1186/1471-2105-14-356

Abstract

BackgroundModern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found.ResultsWe introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats.ConclusionsWormtable’s simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.

Highlights

Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers
Despite the ever increasing volumes of data being processed in bioinformatics, the methods used are almost entirely based on plain text files
The problems of enabling efficient random access to rows and avoiding the large overhead of parsing text are well understood, and efforts to address them are proceeding in parallel for different file formats

Summary

Introduction

Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Text files can be quite compact, and specialised indexing methods are available to retrieve specific rows, for example rows which intersect with a given genomic interval [1] It is not sufficient, to store and retrieve data. This is the major flaw in using text files as a data format; before we can perform calculations, we must first parse the encoded information into native machine values. This is a computationally expensive process, and compression (if it is used) adds substantial overhead. Simple calculations over a large dataset may take many hours to complete

Objectives

Results

Discussion

Conclusion