A new algorithm for “the LCS problem” with application in compressing genome resequencing data

Richard Beal,Aliya Farheen,Tazin Afrin,Don Adjeroh

doi:10.1109/bibm.2015.7359657

Abstract

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. First, we present a new algorithm for the LCS problem. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. For the Homo sapiens genome (original size 3,080,436,051 bytes), our proposed scheme compressed the genome to 5,267,656 bytes). This can be compared with the previous best results of 19,666,791 bytes (Wang and Zhang, 2011) and 17,971,030 bytes (Pinho, Pratas, and Garcia, 2011). Thus, our compression ratio is about 3.73 to 3.41 times better than those from the state-of-the-art reference-based compression algorithms.

Highlights

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data
Recall that the parameter k is a type of threshold used by our compression scheme to determine whether it is more beneficial to encode a symbol verbatim or encode a common substrings (CSSs) as a triple
Our compression algorithm works on the longest previous factor (LPF) in a left-to-right fashion, selecting the leftmost CSS, say T[ i . . . i + l − 1] of length-(LPF[ i] = l), and determining whether to encode that CSS as a triple [and consider the CSS (T[ i + l . . . i + l + LPF[ i + l] −1] of length-LPF[ i + l])], or encode the first symbol (T[ i]) [and consider the CSS (T[ i + 1 . . . i + LPF[ i + 1] ] of length-LPF[ i + 1])]

Summary

Introduction

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. An important approach to this problem is computing the longest common subsequence (LCS) between two strings S1 and S2, i.e. the longest ordered list of symbols common between S1 and S2. Biological applications of the LCS and similarity measurement are varied, from sequence alignment [5] in comparative genomics [6], to phylogenetic construction and analysis, to rapid search in huge biological sequences [7], to compression and efficient storage of the rapidly expanding genomic data sets [8, 9], to re-sequencing a set of strings given a target string [10], an important step in efficient genome assembly.

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A new algorithm for “the LCS problem” with application in compressing genome resequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Nov 1, 2015
Citations: 15	License type: cc-by

Similar Papers

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.
Richard Beal ... Aliya Farheen
BMC genomics | VOL. Suppl 17 4
Richard Beal, et. al.Richard Beal ... Aliya Farheen
01 Aug 2016
BMC genomics | VOL. Suppl 17 4

Algorithms for computing variants of the longest common subsequence problem
Costas S Iliopoulos ... M Sohel Rahman
Theoretical Computer Science | VOL. 395
Costas S Iliopoulos, et. al.Costas S Iliopoulos ... M Sohel Rahman
01 May 2008
Theoretical Computer Science | VOL. 395

Computing a Longest Common Palindromic Subsequence
Shihabur Rahman Chowdhury ... Md Mahbubul Hasan
-
Shihabur Rahman Chowdhury, et. al.Shihabur Rahman Chowdhury ... Md Mahbubul Hasan
01 Jan 2012
01 Jan 2012

Algorithms for Computing Variants of the Longest Common Subsequence Problem
M Sohel Rahman ... Costas S Iliopoulos
-
M Sohel Rahman, et. al.M Sohel Rahman ... Costas S Iliopoulos
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A new algorithm for “the LCS problem” with application in compressing genome resequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers