Sequence Factorization with Multiple References.

Sebastian Wandelt,Ulf Leser,M Sohel Rahman

doi:10.1371/journal.pone.0139000

Abstract

The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

Highlights

The development of novel high-throughput DNA sequencing techniques has led to an exponentially increasing flood of data
Open questions include: How much gain does one obtain when computing a factorization against multiple references? How close to optimal are approximate factorization algorithms, which need considerably less main memory? Which technique/ index should be applied to compute a multi-reference factorization under resource constraints? With this paper we provide an in-depth analysis of multi-reference factorization techniques
Our results show a wide range of factorization rates, factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage

Summary

Introduction

The development of novel high-throughput DNA sequencing techniques has led to an exponentially increasing flood of data. Decreasing costs make it possible to sequence large samples of a given population. Examples for such projects are the 1000-Genomes project [1]; the international cancer sequencing consortium [2]; the UK10K project [3], and the Million. These large-scale projects are generating comprehensive surveys of the genomic landscape of phenotypes (or diseases) by sequencing thousands of genomes [5]. Sequence compression is a key technology to cope with the increasing flood of DNA sequences [7]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Sep 30, 2015
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Sequence Factorization with Multiple References.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

FRESCO: Referential Compression of Highly Similar Sequences
Sebastian Wandelt ... Ulf Leser
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 10
Sebastian Wandelt, et. al.Sebastian Wandelt ... Ulf Leser
01 Sep 2013
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 10

Figure 7: cPlot output for Heterosigma akashiwo gene search with five red-algal derived plastid containing algae reference sequences.
-
-
--
23 Dec 2021
23 Dec 2021

Figure 6: MUMmer plot output for Heterosigma akashiwo gene search against five algae reference sequences.
-
-
--
23 Dec 2021
23 Dec 2021

ReGSP: a visualized application for homology-based gene searching and plotting using multiple reference sequences.
Girum Fitihamlak Ejigu ... Gangman Yi
PeerJ | VOL. 9
Girum Fitihamlak Ejigu, et. al.Girum Fitihamlak Ejigu ... Gangman Yi
23 Dec 2021
PeerJ | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sequence Factorization with Multiple References.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one