PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Luca Nanni,Stefano Ceri,Pietro Pinoli,Arif Canakoglu

doi:10.1186/s12859-019-3159-9

Abstract

BackgroundWith the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.ResultsWe present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.ConclusionsPyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Highlights

With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions
We proposed the Genomic Data Model (GDM) [2] and the GenoMetric Query Language (GMQL) [3, 4], composed by a query language and an engine built on top of Apache Spark [5]
We demonstrate the flexibility of the PyGMQL library through three data analysis workflows, available in the form of Jupyter Notebooks and scripts both in the Supplementary Materials of this paper and in the PyGMQL GitHub repository

Summary

Results

We demonstrate the flexibility of the PyGMQL library through three data analysis workflows, available in the form of Jupyter Notebooks and scripts both in the Supplementary Materials of this paper and in the PyGMQL GitHub repository. The result is visualized as a heatmap, with rows representing promoters and columns representing Chip-Seq experiments This example shows: (i) the integration of local PyGMQL programs with remote repositories, (ii) the possibility to outsource the execution to an external deployment of (Py)GMQL, (iii) the interplay between PyGMQL data and Python libraries written by third parties. The code reported in the Supplementary Material illustrates the data extraction part It uses a normal_cover so as to merge replicates of the same experiment and two join operations, the former for detecting the overlap between each TF region and active promotorial regions, the latter for extracting the pairs of regions of two TFs at minimal distance within such regions. By using the cluster with 10 slaves, we build output data of 381M regions in about half

Conclusions

Background

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 8, 2019
Citations: 16	License type: open-access

R Discovery Prime

R Discovery Prime

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

MetaPro: a scalable and reproducible data processing and analysis pipeline for metatranscriptomic investigation of microbial communities
Billy Taj ... Mobolaji Adeolu
Microbiome | VOL. 11
Billy Taj, et. al.Billy Taj ... Mobolaji Adeolu
27 Jun 2023
Microbiome | VOL. 11

Open-source analytical pipeline for robust data analysis, visualizations and sharing in crop breeding
Waseem Hussain ... Joie Ramos
Plant methods | VOL. 18
Waseem Hussain, et. al.Waseem Hussain ... Joie Ramos
05 Feb 2022
Plant methods | VOL. 18

RADAR-Pipeline: Scalable Feature Generation for Mobile Health Data
Heet Sankesara ... Zulqarnain Rashid
International Journal of Population Data Science | VOL. 9
Heet Sankesara, et. al.Heet Sankesara ... Zulqarnain Rashid
10 Jun 2024
International Journal of Population Data Science | VOL. 9

HOME-BIO (sHOtgun MEtagenomic analysis of BIOlogical entities): a specific and comprehensive pipeline for metagenomic shotgun sequencing data analysis
Carlo Ferravante ... Ylenia D’Agostino
BMC Bioinformatics | VOL. 22
Carlo Ferravante, et. al.Carlo Ferravante ... Ylenia D’Agostino
01 Jul 2021
BMC Bioinformatics | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics