Abstract

Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.

Highlights

  • Next-generation sequencing (NGS) technologies are gradually replacing Sanger sequencing as the dominant sequencing technologies and are yielding a revolutionary impact on genetics and biomedical research

  • Datasets To evaluate the performance of SRComp in compressing short read sequences, we carried out comparative experiments on four different datasets downloaded from the DDBJ Sequence Read Archive (SRA)

  • The first two datasets, as well as one run from the fourth dataset (i.e., SRR027520), were already employed in several previous studies to test a variety of read sequence compression tools [2,3,10,13]

Read more

Summary

Introduction

Next-generation sequencing (NGS) technologies are gradually replacing Sanger sequencing as the dominant sequencing technologies and are yielding a revolutionary impact on genetics and biomedical research These technologies can rapidly sequence DNA on the gigabase scale in a single run, generating hundreds or even thousands of gigabases in just a few days. Read DNA sequences are mixed with their associated quality scores in FASTQ files, they are usually processed separately and compressed using different approaches. A reference-based approach is often used to compress read DNA sequences. This approach first aligns reads to a known reference genome sequence and encodes reads compactly as genomic positions and any aligning differences [5,7,8,11,12]. The reference-based compressed data are at high risk of being inaccessible once the reference genome sequence used for compression is lost [13]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.