Adaptive efficient compression of genomes

Sebastian Wandelt,Ulf Leser

doi:10.1186/1748-7188-7-30

Abstract

Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, memory requirements of the current algorithms are high and run times often are slow. In this paper, we propose an adaptive, parallel and highly efficient referential sequence compression method which allows fine-tuning of the trade-off between required memory and compression speed. When using 12 MB of memory, our method is for human genomes on-par with the best previous algorithms in terms of compression ratio (400:1) and compression speed. In contrast, it compresses a complete human genome in just 11 seconds when provided with 9 GB of main memory, which is almost three times faster than the best competitor while using less main memory.

Highlights

The development of novel high-throughput DNA sequencing techniques has led to an ever increasing flood of data
In many projects only genomes from one species are considered. This means that projects often deal with hundreds of highly similar genomes; for instance, two randomly selected human genomes are identical to an estimated 99.9%. This observation is exploited by so-called referential compression schemes, which only encode the differences of an input sequence with respect to a pre-selected reference sequence
Since the project contains slightly more than 1000 genomes, we have only extracted the first 1000 genomes named in these Variant Call Format (VCF) files

Summary

Background

The development of novel high-throughput DNA sequencing techniques has led to an ever increasing flood of data. This means that projects often deal with hundreds of highly similar genomes; for instance, two randomly selected human genomes are identical to an estimated 99.9% This observation is exploited by so-called referential compression schemes, which only encode the differences of an input sequence with respect to a pre-selected reference sequence. Using 9 GB of memory usage, our method performs up to three times faster than the best competitor while still needing less main memory Both variants achieve similar compression rates of approximately 400:1 for human DNA. The memory consumption during index creation is limited as follows: at each step of the index generation we have one raw reference block of size at most BS bytes in main memory plus (roughly) 4∗BS bytes for its compressed suffix tree. The input string is traversed from left to right, and depending on the current characters in the input and in the reference block, different subroutines are executed

10: FIND-MATCH

Conclusions

Consortium IHGS

17. Ukkonen E

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for Molecular Biology	Publication Date: Nov 12, 2012
Citations: 47	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Adaptive efficient compression of genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology

Lead the way for us

Similar Papers

Processor and Bus Co-scheduling Strategies for Real-time Tasks with Multiple Service-levels
Sanjit Kumar Roy ... Arnab Sarkar
-
Sanjit Kumar Roy, et. al.Sanjit Kumar Roy ... Arnab Sarkar
01 Aug 2021
01 Aug 2021

LCQS: an efficient lossless compression tool of quality scores with random access functionality
Jiabing Fu ... Shoubin Dong
BMC Bioinformatics | VOL. 21
Jiabing Fu, et. al.Jiabing Fu ... Shoubin Dong
18 Mar 2020
BMC Bioinformatics | VOL. 21

Revealing the missing expressed genes beyond the human reference genome by RNA-Seq
Geng Chen ... Leming Shi
BMC Genomics | VOL. 12
Geng Chen, et. al.Geng Chen ... Leming Shi
01 Dec 2011
BMC Genomics | VOL. 12

Robust relative compression of genomes with random access
Sebastian Deorowicz ... Szymon Grabowski
Bioinformatics | VOL. 27
Sebastian Deorowicz, et. al.Sebastian Deorowicz ... Szymon Grabowski
05 Sep 2011
Bioinformatics | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Adaptive efficient compression of genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology