Abstract
The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.
Highlights
The arrival of high throughput DNA sequencing technology has created a deluge of biological data [1]
We propose a new algorithm (Jarvis) that uses a competitive prediction based on two different classes: Weighted context models and Weighted stochastic repeat models
We describe in detail the Weighted context models, the Weighted stochastic repeat models, the competitive prediction model, and the implementation of the algorithm
Summary
The arrival of high throughput DNA sequencing technology has created a deluge of biological data [1]. There are many file formats to represent genomic data—for example, FASTA, FASTQ, BAM/SAM, VCF/BCF, and MSA, and many data compressors to represent these formats [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]. All of these file formats have in common the genomic sequence part, in different phases or using different representations. The ultimate aim of genomics, before downstream analysis, is to assemble high-quality genomic sequences, allowing for having high-quality analysis and consistent scientific findings
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.