A compression method for DNA.

Shengwang Du,Naizheng Bian,Junyi Li,Ruslan Kalendar

doi:10.1371/journal.pone.0238220

Abstract

The development of high-throughput sequencing technology has generated huge amounts DNA data. Many general compression algorithms are not ideal for compressing DNA data, such as the LZ77 algorithm. On the basis of Nour and Sharawi’s method,we propose a new, lossless and reference-free method to increase the compression performance. The original sequences are converted into eight intermediate files and six final files. Then, the LZ77 algorithm is used to compress the six final files. The results show that the compression time is decreased by 83% and the decompression time is decreased by 54% on average.The compression rate is almost the same as Nour and Sharawi’s method which is the fastest method so far. What’s more, our method has a wider range of application than Nour and Sharawi’s method. Compared to some very advanced compression tools at present, such as XM and FCM-Mx, the time for compression in our method is much smaller, on average decreasing the time by more than 90%.

Highlights

The advent of high-throughput sequencing technology has led to a dramatic increase in the size of DNA data
We selected ten genomes with different lengths (1~15 M) from the NCBI(National Center of Biotechnology Information) database as a test data set, all data comes from “http://www.ncbi. nlm.nih.gov”, the data are collected by myself. They are tested on compression ratio, compression time and decompression time
As the amount of DNA data continues to grow, we believe that he LZ77 algorithm, will play a key role in DNA data compression due to their simplicity and applicability

Summary

Introduction

The advent of high-throughput sequencing technology has led to a dramatic increase in the size of DNA data. The GeNML algorithm proposed by Tabus and Korodi [6] uses a special normalized maximum likelihood discrete regression model [7] (NMLComp, characterized by low spatial complexity) as a key part of the algorithm It uses coding blocks and combines with the alternative principle to achieve compression of DNA data. In the first step of the first compression phase, the first 1000 base characters are taken from original sequence as a sample, the frequencies of four base characters A, T, G, and C are calculated and sorted them in descending order. If the first step is applied, the binary lengths obtained in the three files are 30 (nine 1 and twenty one 0), 21 (eight 1 and thirteen 0) and 13 (seven 1 and six 0), so 64 bits are needed to store 30 base characters. Use the LZ77 algorithm to decompress files (f1aa, f1bb, f2aa, f2bb, f3 and f0)

Process each file: Step one

Results and discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Nov 25, 2020
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A compression method for DNA.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Efficient Digital Waveform Compression Method for Logic Simulation of Integrated Circuits
Yuyang Xie ... Wenjian Yu
Journal of Computer-Aided Design & Computer Graphics | VOL. 33
Yuyang Xie, et. al.Yuyang Xie ... Wenjian Yu
01 Nov 2021
Journal of Computer-Aided Design & Computer Graphics | VOL. 33

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.
Youde Ding ... Ji He
Frontiers in Genetics | VOL. 14
Youde Ding, et. al.Youde Ding ... Ji He
01 Jun 2023
Frontiers in Genetics | VOL. 14

Real-Time Image Compression System Using an Embedded Board
...
Journal of Circuits and Systems | VOL. 7
, et. al. ...
27 Mar 2019
Journal of Circuits and Systems | VOL. 7

Efficient Lossless Compression of Integer Astronomical Data
Òscar Maireles-González ... Joan Serra-Sagristà
Publications of the Astronomical Society of the Pacific | VOL. 135
Òscar Maireles-González, et. al.Òscar Maireles-González ... Joan Serra-Sagristà
01 Sep 2023
Publications of the Astronomical Society of the Pacific | VOL. 135

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A compression method for DNA.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one