OpenCGA: a scalable and high-performance platform for big data analysis and visualisation in genomics

Jacobo Coll ,Joaquín Tárraga ,Pablo Marín-García ,Marta Bleda ,Julie Sullivan ,David Gómez-Peregrina ,Antonio Rueda ,P Furió-Tarí ,Antonio Altamura ,José M Juanes ,D Perez-Gil ,Ignacio Medina-Castelló

doi:10.6084/m9.figshare.12895910.v1

Abstract

Background:Current large-scale clinical genomics studies consisting of ten of thousands of whole genome sequences require a platform with the ability to analyze billions of unique variants over hundreds of terabytes of data. Description:OpenCGA is an open-source project that implements a high-performance, scalable and secure platform for genomic data analysis and visualisation. It relies on current big data technologies such as Hadoop, Spark, MongoDB or Solr to implement an advanced analytical variant storage engine that can index and aggregate thousands of whole genomes a day and allows performing real-time queries, aggregations, quality control and genomic analysis. It can also store variant annotation and precomputed cohort stats. An analytical component is also implemented to query and execute different built-in analysis such as clinical interpretation or GWAS using federation. OpenCGA implements a Catalog database to keep track of users, metadata, permissions, clinical data, quality control, etc. Conclusion:OpenCGA is used as a data platform at GEL and other big genomics institutions , and it is available at Microsoft Azure. OpenCGA has proven to scale and perform very well to nearly 100,000 whole genomes accounting for 584 million aggregated variants or 40TB of data. In addition, it implements a rich RESTful web service API, a command line interface and multiple client libraries. An Interactive Variant Analysis (IVA) browser is provided to analyse and visualise biological information from various data sources. OpenCGA and IVA are open-source and part of OpenCB suite.

Full Text