Abstract

BackgroundThe dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage.ResultsThis paper presents GTZ, a compression and transmission tool, optimized for FASTQ files. As a reference-free lossless FASTQ compressor, GTZ treats different lines of FASTQ separately, utilizes adaptive context modelling to estimate their characteristic probabilities, and compresses data blocks with arithmetic coding. GTZ can also be used to compress multiple files or directories at once. Furthermore, as a tool to be used in the cloud computing era, it is capable of saving compressed data locally or transmitting data directly into cloud by choice. We evaluated the performance of GTZ on some diverse FASTQ benchmarks. Results show that in most cases, it outperforms many other tools in terms of the compression ratio, speed and stability.ConclusionsGTZ is a tool that enables efficient lossless FASTQ data compression and simultaneous data transmission onto to cloud. It emerges as a useful tool for NGS data storage and transmission in the cloud environment. GTZ is freely available online at: https://github.com/Genetalks/gtz.

Highlights

  • The dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth

  • We present a tool GTZ, it is characterized as a lossless and efficient compression tool to be used jointly with cloud computing for large-scale genomic data analyses: 1. GTZ exploits context model technology combined with multiple prediction modelling schemes

  • A typical FASTQ file contains four lines per sequence: Line 1 begins with a character ‘@’ followed by a sequence identifier; Line 2 holds the raw sequence composed of A, C, T, and G; line 3 begins with a character ‘+’ and is optionally followed by the same sequence identifier again; line 4 holds the corresponding quality scores in ASCII characters for the sequence characters in line 2

Read more

Summary

Introduction

The dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage. General-propose compression tools, such as gzip (http://www.gzip.org/), bzip (http://www.bzip.org/) and 7z (www.7-zip.org), have been utilized to compress NGS data. These tools do not take advantage of the. Fqzcomp [6] estimates character probabilities by order-k context modelling and compresses NGS data in FASTQ format with the help of arithmetic coders

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.