SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding

Jeremy John Selva,Xin Chen,James C Nelson

doi:10.1371/journal.pone.0081414

Abstract

Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.

Highlights

Next-generation sequencing (NGS) technologies are gradually replacing Sanger sequencing as the dominant sequencing technologies and are yielding a revolutionary impact on genetics and biomedical research
Datasets To evaluate the performance of SRComp in compressing short read sequences, we carried out comparative experiments on four different datasets downloaded from the DDBJ Sequence Read Archive (SRA)
The first two datasets, as well as one run from the fourth dataset (i.e., SRR027520), were already employed in several previous studies to test a variety of read sequence compression tools [2,3,10,13]

Summary

Introduction

Next-generation sequencing (NGS) technologies are gradually replacing Sanger sequencing as the dominant sequencing technologies and are yielding a revolutionary impact on genetics and biomedical research These technologies can rapidly sequence DNA on the gigabase scale in a single run, generating hundreds or even thousands of gigabases in just a few days. Read DNA sequences are mixed with their associated quality scores in FASTQ files, they are usually processed separately and compressed using different approaches. A reference-based approach is often used to compress read DNA sequences. This approach first aligns reads to a known reference genome sequence and encodes reads compactly as genomic positions and any aligning differences [5,7,8,11,12]. The reference-based compressed data are at high risk of being inaccessible once the reference genome sequence used for compression is lost [13]

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Dec 13, 2013
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Current state-of-art of sequencing technologies for plant genomics research
M Thudi ... Y Li
Briefings in Functional Genomics | VOL. 11
M Thudi, et. al.M Thudi ... Y Li
01 Jan 2012
Briefings in Functional Genomics | VOL. 11

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation: Cardiovascular Genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation: Cardiovascular Genetics | VOL. 6

Next Generation Sequencing Technologies and Their Applications
Ku Chee‐Seng ... Pawitan Yudi
-
Ku Chee‐Seng, et. al.Ku Chee‐Seng ... Pawitan Yudi
19 Apr 2010
19 Apr 2010

An online copy number variant detection method for short sequencing reads
Ayten Yiğiter ... Nazan Danacioğlu
Journal of Applied Statistics | VOL. 42
Ayten Yiğiter, et. al.Ayten Yiğiter ... Nazan Danacioğlu
28 Jan 2015
Journal of Applied Statistics | VOL. 42

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE