Machado: Open source genomics data integration framework.

Mauricio De Alvarenga Mudadu,Adhemar Zerlotini

doi:10.1093/gigascience/giaa097

Mauricio De Alvarenga Mudadu, Adhemar Zerlotini

Open Access

https://doi.org/10.1093/gigascience/giaa097

Copy DOI

Journal: GigaScience	Publication Date: Sep 14, 2020
Citations: 4	License type: CC BY 4.0

Affiliation: Brazilian Agricultural Research Corporation

Abstract

BackgroundGenome projects and multiomics experiments generate huge volumes of data that must be stored, mined, and transformed into useful knowledge. All this information is supposed to be accessible and, if possible, browsable afterwards. Computational biologists have been dealing with this scenario for more than a decade and have been implementing software and databases to meet this challenge. The GMOD's (Generic Model Organism Database) biological relational database schema, known as Chado, is one of the few successful open source initiatives; it is widely adopted and many software packages are able to connect to it.FindingsWe have been developing an open source software package named Machado, a genomics data integration framework implemented in Python, to enable research groups to both store and visualize genomics data. The framework relies on the Chado database schema and, therefore, should be very intuitive for current developers to adopt it or have it running on top of already existing databases. It has several data-loading tools for genomics and transcriptomics data and also for annotation results from tools such as BLAST, InterproScan, OrthoMCL, and LSTrAP. There is an API to connect to JBrowse, and a web visualization tool is implemented using Django Views and Templates. The Haystack library integrated with the ElasticSearch engine was used to implement a Google-like search, i.e., single auto-complete search box that provides fast results and filters.ConclusionMachado aims to be a modern object-relational framework that uses the latest Python libraries to produce an effective open source resource for genomics research.

Highlights

Genome projects and multiomics experiments generate huge volumes of data that must be stored, mined and transformed into useful knowledge
We have been developing an open source software named Machado, a genomics data integration framework implemented in Python, to enable research groups to both store and browse, query, and visualize genomics data
The framework relies on the Chado database schema and, should be very intuitive for current developers to adopt it or have it running on the top of already existing databases

Summary

Introduction

Genome projects and multiomics experiments generate huge volumes of data that must be stored, mined and transformed into useful knowledge. Omics data integration offers the potential to increase the productivity and sustainability in crop and livestock production.The challenges are diverse but are usually composed of identifying genetic variation that derive desirable traits that can drive genomic prediction, performing precise genome editing/engineering (e.g.: using CRISPR-CAS systems for the induction of mutations or disruptions in the genome), identifying molecular targets for developing vaccines to diseases/plagues, and probably others (Huang et al, 2017) All these novel genomic information, specially those from genome projects and multiomics experiments (transcriptomics, proteomics, etc.) is supposed to be accessible and, if possible, browseable afterwards. Bioinformaticians and computational biologists have been dealing with this scenario for over a decade and have implemented (and are still implementing) a collection of software libraries, toolkits, platforms, databases and data warehouses in this regard

Methods

Results

Conclusion