A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

Richard Beal,Aliya Farheen,Donald Adjeroh,Tazin Afrin

doi:10.1186/s12864-016-2793-0

Abstract

BackgroundThe longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data.MethodsFirst, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself.ResultsOur basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011).ConclusionWe propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.

Highlights

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data
Recall that the parameter k is a type of threshold used by our compression scheme to determine whether it is more beneficial to encode a symbol verbatim or encode a common substrings (CSSs) as a triple
Our compression algorithm works on the longest previous factor (LPF) in a left-to-right fashion, selecting the leftmost CSS, say T[ i . . . i + l − 1] of length-(LPF[ i] = l), and determining whether to encode that CSS as a triple [and consider the CSS (T[ i + l . . . i + l + LPF[ i + l] −1] of length-LPF[ i + l])], or encode the first symbol (T[ i]) [and consider the CSS (T[ i + 1 . . . i + LPF[ i + 1] ] of length-LPF[ i + 1])]

Summary

Introduction

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. An important approach to this problem is computing the longest common subsequence (LCS) between two strings S1 and S2, i.e. the longest ordered list of symbols common between S1 and S2. Biological applications of the LCS and similarity measurement are varied, from sequence alignment [5] in comparative genomics [6], to phylogenetic construction and analysis, to rapid search in huge biological sequences [7], to compression and efficient storage of the rapidly expanding genomic data sets [8, 9], to re-sequencing a set of strings given a target string [10], an important step in efficient genome assembly.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Aug 1, 2016
Citations: 15	License type: cc-by

R Discovery Prime

R Discovery Prime

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

A new algorithm for “the LCS problem” with application in compressing genome resequencing data
Richard Beal ... Don Adjeroh
-
Richard Beal, et. al.Richard Beal ... Don Adjeroh
01 Nov 2015
01 Nov 2015

Efficient CGM-based parallel algorithms for the longest common subsequence problem with multiple substring-exclusion constraints
Vianney Kengne Tchendji ... Jean Frédéric Myoupo
Parallel Computing | VOL. 91
Vianney Kengne Tchendji, et. al.Vianney Kengne Tchendji ... Jean Frédéric Myoupo
30 Nov 2019
Parallel Computing | VOL. 91

Anytime algorithms for the longest common palindromic subsequence problem
Marko Djukanovic ... Christian Blum
Computers and Operations Research | VOL. 114
Marko Djukanovic, et. al.Marko Djukanovic ... Christian Blum
14 Oct 2019
Computers and Operations Research | VOL. 114

Algorithms for computing variants of the longest common subsequence problem
Costas S Iliopoulos ... M Sohel Rahman
Theoretical Computer Science | VOL. 395
Costas S Iliopoulos, et. al.Costas S Iliopoulos ... M Sohel Rahman
01 May 2008
Theoretical Computer Science | VOL. 395

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics