High dimensional biological data retrieval optimization with NoSQL technology.

Shicai Wang,Chao Wu,Florian Guitton,Ioannis Pandis,Sijin He,David Johnson,Yike Guo,Ibrahim Emam

doi:10.1186/1471-2164-15-s8-s3

Abstract

BackgroundHigh-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data.ResultsIn this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB.ConclusionsThe performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

Highlights

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies
We compared the query retrieval time against a traditional relational data model running on a MySQL Cluster database, and a relational model running on a MongoDB key-value document database
Our results show that in general our key-value implementation of the tranSMART DEAPP schema using HBase outperforms the relational model on both MySQL Cluster and MongoDB, as we originally hypothesized

Summary

Introduction

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. For optimal analyzes and meaningful interpretations, such databases store legacy data taken from public sources, such as the Gene Expression Omnibus (GEO) [1] and the Gene Expression Atlas [2], alongside new study data. For the needs of various collaborative translational research projects, an instance of tranSMART is hosted at Imperial College London and has been configured to use an Oracle relational database for back-end storage It currently holds over 70 million gene expression records. When querying the database simultaneously for hundreds of patient gene expression records, a typical exercise in translational studies, the record retrieval time can currently take up to several minutes These kinds of response times impede analyzes performed by researchers using this deployed configuration of tranSMART. Anticipating the requirement to store and analyze generation sequencing data, where the volume of data being produced will be in the terabyte (TB) range, the current performance exhibited by tranSMART is unacceptably poor

Objectives

Results

Conclusion