Exploiting Genomic Relations in Big Data Repositories by Graph-Based Search Methods

Aliyu Musa,Olli Yli-Harja,Frank Emmert-Streib,Matthias Dehmer

doi:10.3390/make1010012

Abstract

We are living at a time that allows the generation of mass data in almost any field of science. For instance, in pharmacogenomics, there exist a number of big data repositories, e.g., the Library of Integrated Network-based Cellular Signatures (LINCS) that provide millions of measurements on the genomics level. However, to translate these data into meaningful information, the data need to be analyzable. The first step for such an analysis is the deliberate selection of subsets of raw data for studying dedicated research questions. Unfortunately, this is a non-trivial problem when millions of individual data files are available with an intricate connection structure induced by experimental dependencies. In this paper, we argue for the need to introduce such search capabilities for big genomics data repositories with a specific discussion about LINCS. Specifically, we suggest the introduction of smart interfaces allowing the exploitation of the connections among individual raw data files, giving raise to a network structure, by graph-based searches.

Highlights

In the last 20 years, technological progress in high-throughput assays, e.g., next-generation sequencing, led to a tremendous increase of our data generation capabilities in genomics
The data repositories we are concerned with in our paper, store raw data files. We discuss this problem by focusing on the pharmacogenomic data repository LINCS (Library of Integrated Network-based Cellular Signatures) [3,4,5,6,7,8] and describe how this lack in querying capability could be compensated
We use the term database to refer to an organized collection and storage of data for which a database management system (DBMS) is available that allows querying the data from the database

Summary

Introduction

In the last 20 years, technological progress in high-throughput assays, e.g., next-generation sequencing, led to a tremendous increase of our data generation capabilities in genomics. The problem is that accessing selected subsets of these “big data” for performing a dedicated analysis is non-trivial due to the sheer number of data and, more importantly, the complexity of the connections between different data points. Most data collections do not provide efficient interfaces enabling a direct access to subsets of raw data, hampering downstream analysis. We would like to highlight that, here, we are concerned with accessing and selecting raw data, not knowledge that has been derived by processing and analyzing raw data and stored in knowledge databases. The data repositories we are concerned with in our paper, store raw data files (see Figure 1 for a brief overview). We discuss this problem by focusing on the pharmacogenomic data repository LINCS (Library of Integrated Network-based Cellular Signatures) [3,4,5,6,7,8] and describe how this lack in querying capability could be compensated

Preliminaries

The Pharmacogenomics Data Repository LINCS

Conceptual Idea

Further Applications

Conclusions