High-speed and high-ratio referential genome compression.

Yuansheng Liu,Jinyan Li,Hui Peng,Limsoon Wong

doi:10.1093/bioinformatics/btx412

Yuansheng Liu, Jinyan Li + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/btx412

Copy DOI

Abstract

The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio. We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent. The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC. jinyan.li@uts.edu.au. Supplementary data are available at Bioinformatics online.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

High-speed and high-ratio referential genome compression.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Jun 23, 2017
Citations: 31

Similar Papers

Modern lossless compression techniques: Review, comparison and analysis
Apoorv Gupta ... Aman Bansal
-
Apoorv Gupta, et. al.Apoorv Gupta ... Aman Bansal
01 Feb 2017
01 Feb 2017

A Fast Fractal Based Compression for MRI Images
Shuai Liu ... Nianyin Zeng
IEEE Access | VOL. 7
Shuai Liu, et. al.Shuai Liu ... Nianyin Zeng
01 Jan 2019
IEEE Access | VOL. 7

MLC: An Efficient Multi-level Log Compression Method for Cloud Backup Systems
Bo Feng ... Jie Li
-
Bo Feng, et. al.Bo Feng ... Jie Li
01 Aug 2016
01 Aug 2016

LCQS: an efficient lossless compression tool of quality scores with random access functionality
Jiabing Fu ... Shoubin Dong
BMC Bioinformatics | VOL. 21
Jiabing Fu, et. al.Jiabing Fu ... Shoubin Dong
18 Mar 2020
BMC Bioinformatics | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High-speed and high-ratio referential genome compression.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics