A New Lossless DNA Compression Algorithm Based on A Single-Block Encoding Scheme

Deloula Mansouri,Abdeldjalil Saidani,Xiaohui Yuan

doi:10.3390/a13040099

Deloula Mansouri, Abdeldjalil Saidani + Show 1 more

Open Access

https://doi.org/10.3390/a13040099

Copy DOI

Journal: Algorithms	Publication Date: Apr 20, 2020
Citations: 9	License type: CC BY 4.0

Affiliation: Wuhan University of Technology

Abstract

With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Therefore, to overcome these challenges, compression has become necessary. In this paper, we describe a new reference-free DNA compressor abbreviated as DNAC-SBE. DNAC-SBE is a lossless hybrid compressor that consists of three phases. First, starting from the largest base (Bi), the positions of each Bi are replaced with ones and the positions of other bases that have smaller frequencies than Bi are replaced with zeros. Second, to encode the generated streams, we propose a new single-block encoding scheme (SEB) based on the exploitation of the position of neighboring bits within the block using two different techniques. Finally, the proposed algorithm dynamically assigns the shorter length code to each block. Results show that DNAC-SBE outperforms state-of-the-art compressors and proves its efficiency in terms of special conditions imposed on compressed data, storage space and data transfer rate regardless of the file format or the size of the data.

Highlights

Bioinformatics is a combination of the field of informatics and biology, in which computational tools and approaches are applied to solve the problems that biologists face in different domains, such as agricultural, medical science and study of the living world
Around three billion characters and more than 23 pairs of chromosomes are in the human genome, in a gram of soil there is 40 million bacterial cells, and certain amphibian species can even have more than 100 billion nucleotides [3,6]
The authors described a new reference-free DNA compressor (DNAC-SBE), a lossless hybrid compressor that consisted of three phases, to compress any type of DNA sequences

Summary

Introduction

Bioinformatics is a combination of the field of informatics and biology, in which computational tools and approaches are applied to solve the problems that biologists face in different domains, such as agricultural, medical science and study of the living world. The MFCompress [8] is one of the most efficient lossless non-referential compression algorithms for FASTA files compaction according to a recent survey [13] It divides the data into two separate kinds of data: one containing the nucleotide sequences, the other one the headers of the FASTA records. In [10], the authors present new FASTQ compression algorithm named DSRC (DNA sequence reads compressor) They impose a hierarchical structure of the compressed data by dividing the data into blocks and superblocks; it encodes the superblocks independently to provide fast random access to any record. The DNAC-SEB is a reference-free method; it does not depend on any specific reference genome or any patterns and may work with any type of DNA sequence and with no ATCG characters It encodes each block of data separately and does not require any additional information in the decoding step.

Methodology

Binary

A DNA sequence

Illustration

Decoding Process

Performance

Compression Performance

Method

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A New Lossless DNA Compression Algorithm Based on A Single-Block Encoding Scheme

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Analysis for Lossless Data Compression Algorithms for Low Bandwidth Networks
Wogderes Semunigus ... Balachandra Pattanaik
Journal of Physics: Conference Series | VOL. 1964
Wogderes Semunigus, et. al.Wogderes Semunigus ... Balachandra Pattanaik
01 Jul 2021
Journal of Physics: Conference Series | VOL. 1964

Empirical and Statistical Evaluation of the Effectiveness of Four Lossless Data Compression Algorithms
N.A Azeez ... A.A Lasisi
Nigerian Journal of Technological Development | VOL. 13
N.A Azeez, et. al.N.A Azeez ... A.A Lasisi
13 Mar 2017
Nigerian Journal of Technological Development | VOL. 13

Lossless Compression Techniques in Edge Computing for Mission-Critical Applications in the IoT
T N Gia ... L Qingqing
-
T N Gia, et. al.T N Gia ... L Qingqing
01 Nov 2019
01 Nov 2019

BIND – An algorithm for loss-less compression of nucleotide sequence data
Tungadri Bose ... Monzoorul Haque Mohammed
Journal of Biosciences | VOL. 37
Tungadri Bose, et. al.Tungadri Bose ... Monzoorul Haque Mohammed
26 Aug 2012
Journal of Biosciences | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A New Lossless DNA Compression Algorithm Based on A Single-Block Encoding Scheme

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms