Abstract

Due to the randomness and noisiness, quality scores presented in sequencing data have already comprised about 70% of the compressed storage and reached their lossless compression limit. Lossy compression of quality scores, which guarantees the performance of subsequent variant calling procedure, has been an ideal candidate and a great challenge in big genomic data analysis. Currently, state-of-the-art locality-based lossy compressor PBlock, based on the assumption that all the quality score lines should exhibit a single locality, applies identical and static locality criterion manually to smooth all the different quality score lines. However, this assumption is usually not the real case and would inevitably result in sub-optimal locality criteria in some quality score lines, which eventually leads to performance degradation of lossy compression and variant calling procedure. Therefore, on the basis of a more reasonable assumption that different quality score lines should exhibit different locality, an enhanced version of lossy compressor PBlock called ALL-CQS is proposed. In this paper, ALL-CQS applies adaptive locality criteria to smooth different quality ality scores lines automatically based on PBlock' lossy mechanism. Experimental results reveal that our lossy compressor ALL-CQS not only achieves the best variant calling performance which is very close to the lossless one, but also outperforms all the other state-of-the-art lossy compressors and achieves up to 145% improvements over the original lossless compressors in terms of compression ratio.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call