StreamAligner: a streaming based sequence aligner on Apache Spark

Sanjay Rathee,Arti Kashyap

doi:10.1186/s40537-018-0114-y

Sanjay Rathee, Arti Kashyap

Open Access

PDF Available

https://doi.org/10.1186/s40537-018-0114-y

Copy DOI

Export

Save

Cite

Journal: Journal of Big Data	Publication Date: Feb 27, 2018
Citations: 5	License type: open-access

Affiliation: Indian Institute of Technology Mandi

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Next-Generation Sequencing technologies are generating a huge amount of genetic data that need to be mapped and analyzed. Single machine sequence alignment tools are becoming incapable or inefficient in keeping track of the same. Therefore, distributed computing platforms based on MapReduce paradigm, which uses thousands of commodity machines to process and analyze huge datasets, are emerging as the best solution for growing genomics data. A lot of MapReduce-based sequence alignment tools like CloudBurst, CloudAligner, Halvade, and SparkBWA are proposed by various researchers in recent few years. These sequence aligners are very fast and efficient. These sequence aligners are capable of aligning billions of reads (stored as fasta or fastq files) on reference genome in few minutes. In the current era of fastly growing technology, analyzing huge genome data fast is not enough. We need to analyze data in real time to automate alignment process. Therefore, we propose a MapReduce-based sequence alignment tool StreamAligner which is implemented on Spark streaming engine. StreamAligner can align stream of reads on reference genome in real time. Therefore, it can be used to automate sequencing and alignment process. It uses suffix array index for read alignment which is generated using distributed index generation algorithm. Due to distributed index generation algorithm, index generation time is very less. It needs to upload index only once when StreamAligner is launched. After that index stays in Spark memory and can be used for an unlimited times without reloading. Whereas, current state-of-the-art sequence aligner either generate (hash index based) or load (sorted index based) index for every task. Hence, StreamAligner reduces time to generate or load index for every task. A working and tested implementation of streamAligner is available on GitHub for download and use. We tested the effectiveness, efficiency, and scalability of our aligner for various standard and real-life datasets.

Highlights

The trend of using latest computer technology to manage biological information is on the rapid rise during last decade
We propose a MapReducebased sequence alignment tool StreamAligner which is implemented on Spark streaming engine
Cluster and dataset We evaluated performance of StreamAligner on a cluster having five nodes where each node have 32 cores and 64 GB RAM

Summary

Introduction

The trend of using latest computer technology to manage biological information is on the rapid rise during last decade. Sequence alignment is like the heart of bioinformatics field and has attracted huge attention by researchers. Sequence alignment is a way to identify regions of similarity between two sequences of genome data. Sequence alignment has various applications like identifying homologous proteins, Rathee and Kashyap J Big Data (2018) 5:8 analyzing gene expressions and mapping variations between individuals. Riccardo Sabatini [1] in his TEDx talk, showed how they are able to read the genome and build a human from this information. They find out the sequences which are responsible for dissimilarity between humans and used this information to make a human face from his/her DNA

Methods

Results

Conclusion