A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.

Muhammad Tahir,Muhammad Sardaraz

doi:10.3390/genes11020166

Muhammad Tahir, Muhammad Sardaraz

Open Access

https://doi.org/10.3390/genes11020166

Copy DOI

Journal: Genes	Publication Date: Feb 5, 2020
Citations: 7	License type: CC BY 4.0

Affiliation: COMSATS University Islamabad

Abstract

Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.

Highlights

The knowledge base of biological data can be collected from natural life, scientific experiments, and research archives
Each workflow experiment was executed 100 times and average time in seconds is computed for sample datasets, where the results of real clusters are recorded in minutes for clear visualization and ease of understanding
For scalability analysis all the workflows are evaluated on real compute clusters of different configurations i.e., 8 compute nodes @ 116 GHz processing power with 32 cores equipped with 112 GB of memory, 16 compute nodes @ 237.60 GHz processing power with 64 cores equipped with 304 GB of memory and 32 compute nodes @ 471.2 GHz processing power with 128 cores equipped with 560 GB of memory

Summary

Introduction

The knowledge base of biological data can be collected from natural life, scientific experiments, and research archives. Classical organism databases are purposeful where species-specific data are available, as it has great significance in new discoveries. The biological databases have a significant role in bioinformatics as it helps to approach a wide range of biological data along with increased varieties of organisms. Many biological research studies have been piloted and formed significant resources for genomic data. It is often declared that these data resources have not been fully explored yet [1]. These data sources posture statistical problems; e.g., the family-wise error rate (FWER) [2]

Objectives

Methods

Results