An efficient normalized maximum likelihood algorithm for DNA sequence compression

Gergely Korodi,Ioan Tabus

doi:10.1145/1055709.1055711

Abstract

This article presents an efficient algorithm for DNA sequence compression, which achieves the best compression ratios reported over a test set commonly used for evaluating DNA compression programs. The algorithm introduces many refinements to a compression method that combines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, through reference to preceding approximate matching blocks, (2) encoding by a first order context coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantly the search time, but also can be used to modify the NML model to exploit the constraint for getting smaller code lengths. The algorithm handles the changing statistics of DNA data in an adaptive way and by predictively encoding the matching pointers it is successful in compressing long approximate matches. Apart from comparison with previous DNA encoding methods, we present compression results for the recently published human genome data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An efficient normalized maximum likelihood algorithm for DNA sequence compression

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems

Lead the way for us

Journal: ACM Transactions on Information Systems	Publication Date: Jan 1, 2005
Citations: 95

Similar Papers

A hybrid particle swarm optimization based memetic algorithm for DNA sequence compression
Li Tan ... Jifeng Sun
Soft Computing | VOL. 19
Li Tan, et. al.Li Tan ... Jifeng Sun
22 Jun 2014
Soft Computing | VOL. 19

DNA sequence compression using the normalized maximum likelihood model for discrete regression
I. Tabus ... G. Korodi
-
I. Tabus, et. al.I. Tabus ... G. Korodi
25 Mar 2003
25 Mar 2003

Classification and feature gene selection using the normalized maximum likelihood model for discrete regression
Ioan Tabus ... Jaakko Astola
Signal Processing | VOL. 83
Ioan Tabus, et. al.Ioan Tabus ... Jaakko Astola
13 Dec 2002
Signal Processing | VOL. 83

Efficient Storage of Massive Biological Sequences in Compact Form
Ashutosh Gupta ... Vinay Rishiwal
-
Ashutosh Gupta, et. al.Ashutosh Gupta ... Vinay Rishiwal
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An efficient normalized maximum likelihood algorithm for DNA sequence compression

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems