Abstract

Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/.

Highlights

  • Sequence alignment is used in bioinformatics to identify degree of similarity between biological sequences (DNA, RNA or protein), in understanding functional, structural and evolutionary relationship between them

  • Multiple sequence alignment (MSA) algorithms supporting massive genome sequences are still in the evolving phase and there is a lack of benchmark dataset when it comes to large scale DNA MSA algorithms

  • We have focused on improving the efficiency of MSA involving large DNA sequences by utilizing its similarity feature and improving the performance with learning layer and parallel execution

Read more

Summary

Introduction

Sequence alignment is used in bioinformatics to identify degree of similarity between biological sequences (DNA, RNA or protein), in understanding functional, structural and evolutionary relationship between them. Needleman-Wunsch(NW) algorithm[1], was one of the first implementations of dynamic programming in bioinformatics It was an optimal sequence alignment algorithm with a tradeoff in computational time and space. The algorithm uses centre star strategy along with trie tree data structure to improve the performance Spark version of this algorithm HAlign II9, to support large volume of sequences reported promising results for similar DNA/RNA sequence alignment. Key characteristics of the proposed algorithm include, (a) Suffix tree data structure for storing input sequences and identifying common substrings between sequences, (b) A knowledge base and nearest neighbor learning layer to guide the pairwise alignment, (c) Modified Needleman-Wunsch algorithm to perform pairwise alignments at each stage in order to reduce the memory and execution time of alignments and (d) Parallelization using MapReduce method for suffix tree construction and pairwise alignment to further improve the execution time

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.