Abstract

With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Therefore, to overcome these challenges, compression has become necessary. In this paper, we describe a new reference-free DNA compressor abbreviated as DNAC-SBE. DNAC-SBE is a lossless hybrid compressor that consists of three phases. First, starting from the largest base (Bi), the positions of each Bi are replaced with ones and the positions of other bases that have smaller frequencies than Bi are replaced with zeros. Second, to encode the generated streams, we propose a new single-block encoding scheme (SEB) based on the exploitation of the position of neighboring bits within the block using two different techniques. Finally, the proposed algorithm dynamically assigns the shorter length code to each block. Results show that DNAC-SBE outperforms state-of-the-art compressors and proves its efficiency in terms of special conditions imposed on compressed data, storage space and data transfer rate regardless of the file format or the size of the data.

Highlights

  • Bioinformatics is a combination of the field of informatics and biology, in which computational tools and approaches are applied to solve the problems that biologists face in different domains, such as agricultural, medical science and study of the living world

  • Around three billion characters and more than 23 pairs of chromosomes are in the human genome, in a gram of soil there is 40 million bacterial cells, and certain amphibian species can even have more than 100 billion nucleotides [3,6]

  • The authors described a new reference-free DNA compressor (DNAC-SBE), a lossless hybrid compressor that consisted of three phases, to compress any type of DNA sequences

Read more

Summary

Introduction

Bioinformatics is a combination of the field of informatics and biology, in which computational tools and approaches are applied to solve the problems that biologists face in different domains, such as agricultural, medical science and study of the living world. The MFCompress [8] is one of the most efficient lossless non-referential compression algorithms for FASTA files compaction according to a recent survey [13] It divides the data into two separate kinds of data: one containing the nucleotide sequences, the other one the headers of the FASTA records. In [10], the authors present new FASTQ compression algorithm named DSRC (DNA sequence reads compressor) They impose a hierarchical structure of the compressed data by dividing the data into blocks and superblocks; it encodes the superblocks independently to provide fast random access to any record. The DNAC-SEB is a reference-free method; it does not depend on any specific reference genome or any patterns and may work with any type of DNA sequence and with no ATCG characters It encodes each block of data separately and does not require any additional information in the decoding step.

Methodology
Binary
A DNA sequence
Illustration
Decoding Process
Performance
Compression Performance
Method
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.