A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

Diogo Pratas,Jorge M Silva,Armando J Pinho,Morteza Hosseini

doi:10.3390/e21111074

Diogo Pratas, Jorge M Silva + Show 2 more

Open Access

https://doi.org/10.3390/e21111074

Copy DOI

Journal: Entropy	Publication Date: Nov 2, 2019
Citations: 14	License type: CC BY 4.0

Affiliation: University of Aveiro, University of Helsinki

Abstract

The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.

Highlights

The arrival of high throughput DNA sequencing technology has created a deluge of biological data [1]
We propose a new algorithm (Jarvis) that uses a competitive prediction based on two different classes: Weighted context models and Weighted stochastic repeat models
We describe in detail the Weighted context models, the Weighted stochastic repeat models, the competitive prediction model, and the implementation of the algorithm

Summary

Introduction

The arrival of high throughput DNA sequencing technology has created a deluge of biological data [1]. There are many file formats to represent genomic data—for example, FASTA, FASTQ, BAM/SAM, VCF/BCF, and MSA, and many data compressors to represent these formats [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]. All of these file formats have in common the genomic sequence part, in different phases or using different representations. The ultimate aim of genomics, before downstream analysis, is to assemble high-quality genomic sequences, allowing for having high-quality analysis and consistent scientific findings

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy

Lead the way for us

Similar Papers

Lossless Compression of Digital Mammographic Images
...
-
, et. al. ...
22 Feb 2012
22 Feb 2012

Integer to integer multiwavelets for lossless image compression
Yu Shen ... Xieping Gao
-
Yu Shen, et. al.Yu Shen ... Xieping Gao
01 Oct 2011
01 Oct 2011

LOCO-I: a low complexity, context-based, lossless image compression algorithm
M.J Weinberger ... G Seroussi
-
M.J Weinberger, et. al.M.J Weinberger ... G Seroussi
31 Mar 1996
31 Mar 1996

Median predictor-based lossless video compression algorithm for IR image sequences
Ram Saran ... Hari Babu Srivastava
-
Ram Saran, et. al.Ram Saran ... Hari Babu Srivastava
01 Sep 2007
01 Sep 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy