A Novel Efficient Graph Model for the Multiple Longest Common Subsequences (MLCS) Problem.

Zhan Peng,Yuping Wang

doi:10.3389/fgene.2017.00104

Abstract

Searching for the Multiple Longest Common Subsequences (MLCS) of multiple sequences is a classical NP-hard problem, which has been used in many applications. One of the most effective exact approaches for the MLCS problem is based on dominant point graph, which is a kind of directed acyclic graph (DAG). However, the time and space efficiency of the leading dominant point graph based approaches is still unsatisfactory: constructing the dominated point graph used by these approaches requires a huge amount of time and space, which hinders the applications of these approaches to large-scale and long sequences. To address this issue, in this paper, we propose a new time and space efficient graph model called the Leveled-DAG for the MLCS problem. The Leveled-DAG can timely eliminate all the nodes in the graph that cannot contribute to the construction of MLCS during constructing. At any moment, only the current level and some previously generated nodes in the graph need to be kept in memory, which can greatly reduce the memory consumption. Also, the final graph contains only one node in which all of the wanted MLCS are saved, thus, no additional operations for searching the MLCS are needed. The experiments are conducted on real biological sequences with different numbers and lengths respectively, and the proposed algorithm is compared with three state-of-the-art algorithms. The experimental results show that the time and space needed for the Leveled-DAG approach are smaller than those for the compared algorithms especially on large-scale and long sequences.

Highlights

Measuring the similarity of biological sequences is a fundamental problem in bioinformatics, which has many applications such as in cancer diagnosis (Aravanis et al, 2017) and detection of the species common origin (Zvelebil and Baum, 2007), etc
A Novel Graph Model for the Multiple Longest Common Subsequences (MLCS) Problem look for the longest common subsequence of many sequences
One of the most important applications of the MLCS algorithms in bioinformatics is multiple sequence alignment (MSA), which is an essential technique of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

Summary

Introduction

Measuring the similarity of biological sequences is a fundamental problem in bioinformatics, which has many applications such as in cancer diagnosis (Aravanis et al, 2017) and detection of the species common origin (Zvelebil and Baum, 2007), etc. One of the most important ways to measure the similarity of sequences is to find their Longest Common Subsequences (LCS), which has been proved to be a NP-hard problem (Maier, 1978). (2) Looking for the longest common subsequence of more than two sequences is called the Multiple Longest Common Subsequences (MLCS) problem. A Novel Graph Model for the MLCS Problem look for the longest common subsequence of many sequences. One of the most important applications of the MLCS algorithms in bioinformatics is multiple sequence alignment (MSA), which is an essential technique of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. There are algorithms proposed for these applications, they are not efficient enough for many and long sequences due to their high time and space overhead

Results

Discussion

Conclusion