Sherlock: an open-source data platform to store, analyze and integrate Big Data for biology.

Luca Csabai,Balazs Bohar,Matthew Madgwick,Mate Szalay-Beko,Tamás Korcsmáros,David Fazekas,Marton Olbei

doi:10.12688/f1000research.52791.1

Luca Csabai, Balazs Bohar + Show 5 more

Open Access

https://doi.org/10.12688/f1000research.52791.1

Copy DOI

Abstract

In the era of Big Data, data collection underpins biological research more so than ever before. In many cases this can be as time-consuming as the analysis itself, requiring downloading multiple different public databases, with different data structures, and in general, spending days before answering any biological questions. To solve this problem, we introduce an open-source, cloud-based big data platform, called Sherlock ( https://earlham-sherlock.github.io/). Sherlock provides a gap-filling way for biologists to store, convert, query, share and generate biology data, while ultimately streamlining bioinformatics data management. The Sherlock platform provides a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to analyse, process, query and extract the information from extremely complex and large data sets. Furthermore, Sherlock is capable of handling different structured data (interaction, localization, or genomic sequence) from several sources and converting them to a common optimized storage format, for example to the Optimized Row Columnar (ORC). This format facilitates Sherlock's ability to quickly and easily execute distributed analytical queries on extremely large data files as well as share datasets between teams. The Sherlock platform is freely available on Github, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users are able to easily and quickly create and work with the specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, data analytics, data integration and collaboration through modern big data technologies.

Highlights

Most bioinformatics projects start with gathering a lot of data
In Sherlock we provide simple scripts to save this metadata in the Data Lake when you want to make a backup or before you want to terminate your analytical cluster
Sherlock has a lot of features, which are the followings: 1) store all datasets in redundant and organized cloud storage, 2) convert all datasets to common, optimized file formats, 3) execute analytical queries on top of data files, 4) share datasets among different teams/projects, 5) generate operational datasets for certain services or collaborators, 6) it is really useful for any groups/teams in the field of computational biology, who has to work with very large datasets for their projects

Summary

Introduction

Most bioinformatics projects start with gathering a lot of data. Often bioinformaticians have to work on bespoke datasets, for example, gene expression or mutation data, but in almost all cases this requires some sort of external reference data. It is important to mention that Sherlock has been designed for biologists, especially for network and system biologists It can contain specific, interaction, expression and genome data-related databases thanks to loader scripts that are able to create the specific file formats from the source databases and upload them into the Data Lake. Use Case 2: Tissue specificity In this example, we would like to query the top 100 most highly expressed genes in a given tissue (in this case the human colon), and their protein interactors The limitation with this use case is similar to the previous one: the user has to download the different structured interaction databases from web resources and write scripts or use online tools to work with the data at once. In the other case, when we want to enrich the network, it is enough to select all of the interactions where either the source or the target proteins are among the proteins of our interest

Discussion

Conclusion