Abstract

Next Generation Sequencing (NGS) platforms generate nucleotide sequences with header data and quality information. These platforms may produce gigabyte-scale datasets. The biggest problem of NGS technology is the storage of these datasets. Nucleotide sequences, supporting information and quality scores are stored in FASTQ format. In this paper, we consider the compression of quality scores and propose an algorithm for lossless compression of quality scores. We try to find a model that gives the lowest entropy on quality score data. We combine our powerful statistical model with arithmetic coding to compress the quality score data the smallest. We compare its performance to text compression utilities such as bzip2, gzip and ppmd and existing compression algorithms for quality scores. We show that the performance of our compression algorithm is superior to that of both systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.