SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

V. Vineetha,Achuthsankar S. Nair,C. L. Biji

doi:10.1038/s41598-019-42966-5

V. Vineetha, Achuthsankar S. Nair + Show 1 more

Open Access

https://doi.org/10.1038/s41598-019-42966-5

Copy DOI

Journal: Scientific Reports	Publication Date: Apr 29, 2019
Citations: 7	License type: open-access

Affiliation: University of Kerala

Abstract

Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/.

Highlights

Sequence alignment is used in bioinformatics to identify degree of similarity between biological sequences (DNA, RNA or protein), in understanding functional, structural and evolutionary relationship between them
Multiple sequence alignment (MSA) algorithms supporting massive genome sequences are still in the evolving phase and there is a lack of benchmark dataset when it comes to large scale DNA MSA algorithms
We have focused on improving the efficiency of MSA involving large DNA sequences by utilizing its similarity feature and improving the performance with learning layer and parallel execution

Summary

Introduction

Sequence alignment is used in bioinformatics to identify degree of similarity between biological sequences (DNA, RNA or protein), in understanding functional, structural and evolutionary relationship between them. Needleman-Wunsch(NW) algorithm[1], was one of the first implementations of dynamic programming in bioinformatics It was an optimal sequence alignment algorithm with a tradeoff in computational time and space. The algorithm uses centre star strategy along with trie tree data structure to improve the performance Spark version of this algorithm HAlign II9, to support large volume of sequences reported promising results for similar DNA/RNA sequence alignment. Key characteristics of the proposed algorithm include, (a) Suffix tree data structure for storing input sequences and identifying common substrings between sequences, (b) A knowledge base and nearest neighbor learning layer to guide the pairwise alignment, (c) Modified Needleman-Wunsch algorithm to perform pairwise alignments at each stage in order to reduce the memory and execution time of alignments and (d) Parallelization using MapReduce method for suffix tree construction and pairwise alignment to further improve the execution time

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework
Kazutaka Katoh ... Hiroyuki Toh
BMC Bioinformatics | VOL. 9
Kazutaka Katoh, et. al.Kazutaka Katoh ... Hiroyuki Toh
25 Apr 2008
BMC Bioinformatics | VOL. 9

Multiple sequence alignment accuracy and evolutionary distance estimation.
Michael S Rosenberg
BMC Bioinformatics | VOL. 6
Michael S RosenbergMichael S Rosenberg
23 Nov 2005
BMC Bioinformatics | VOL. 6

A novel expert system for the prediction of accurate multiple sequence alignment and phylogenetic tree construction algorithms
Fanaja Harianja Randriamahenintsoa ... Toky Hajatiana Raboanary
-
Fanaja Harianja Randriamahenintsoa, et. al.Fanaja Harianja Randriamahenintsoa ... Toky Hajatiana Raboanary
01 Sep 2017
01 Sep 2017

COFFEE: an objective function for multiple sequence alignments.
C Notredame ... L Holm
Bioinformatics | VOL. 14
C Notredame, et. al.C Notredame ... L Holm
01 Jun 1998
Bioinformatics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports