Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

Shakuntala Baichoo,Brian D O’Connor,Azza E Ahmed,Ayton Meintjes,Gerrit Botha,Long Yi,Scott Hazelhurst,Michael R Crusoe ,Oussema Souiai,Mamana Mbiyavanga,Lerato E Magosi ,C V Jongeneel ,Shaun Aron,Eugene De Beste,Fourie Joubert,Don Armstrong,Hocine Bendou,Faisal M Fadlelmola,Phelelani T Mpangase,Liudmila Sergeevna Mainzer,Jennie Zermeno,Peter Van Heusden,Sumir Panji,Mustafa Alghali,Yassine Souilmi,Nicola Mulder

doi:10.1186/s12859-018-2446-1

Abstract

BackgroundThe Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging.ResultsH3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community.ConclusionThe H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.

Highlights

The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries
We identified the most important analyses as (1) variant calling on generation sequence (NGS) data; (2) 16S rRNA (16S) rDNA sequence analysis for metagenomics; (3) genome-wide association studies; and (4) imputation
The workflow requires Docker to be installed, and the Genome Analysis ToolKit (GATK) jar file and sequence reads in fastq format as input files

Summary

Introduction

The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Processes that were previously slow and research exclusive tasks, have become routine applications in day-to-day operations in bioinformatics and medical genomics. These advances have resulted in a biomedical data deluge with sequencing centres routinely generating data in the petabyte scale, leaving researchers and clinicians with a data processing and analysis bottleneck. Modern workflow management systems offer a high level of reproducibility, portability and computing platform independence enabling researchers to focus more on developing new methods and the interpretation of the results

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 29, 2018
Citations: 32	License type: open-access

R Discovery Prime

R Discovery Prime

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Publishing wars
Helen Frankish
The Lancet | VOL. 364
Helen FrankishHelen Frankish
01 Oct 2004
The Lancet | VOL. 364

Francis Collins: Director of the US National Institutes of Health
Tony Kirby
The Lancet | VOL. 374
Tony KirbyTony Kirby
01 Sep 2009
The Lancet | VOL. 374

The Psychiatric GWAS Consortium: Big Science Comes to Psychiatry
Patrick F Sullivan
Neuron | VOL. 68
Patrick F SullivanPatrick F Sullivan
01 Oct 2010
Neuron | VOL. 68

The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium as a Model of Collaborative Science
Bruce M Psaty ... Colleen Sitlani
Epidemiology | VOL. 24
Bruce M Psaty, et. al.Bruce M Psaty ... Colleen Sitlani
01 May 2013
Epidemiology | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics