Abstract

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.

Highlights

  • Next-Generation Sequencing technologies (NGS for short) have enabled DNA sequencing at a surprising high speed and low costs

  • We analyze different approaches for compressing FASTQ, Sequence Alignment/Map (SAM), and binary SAM (BAM) files generated by Next-Generation Sequencing tools containing nucleotide sequences, extending and improving the research we presented in [5]

  • FASTQ, SAM, and BAM files can be compressed by using generic lossless compression tools

Read more

Summary

Introduction

Next-Generation Sequencing technologies (NGS for short) have enabled DNA sequencing at a surprising high speed and low costs. Because of the new hardware and software tools recently developed, there have been frequent changes in the data formats that describe the sequencing results. One of the most important results has been the development of a well-defined and automated workflow for the analysis of genetic data [7]

Workflow
File Formats
Compression of Next-Generation Sequencing Data
Data Compression Tools
DNA Compression
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.