Better quality score compression through sequence-based quality smoothing

Yoshihiro Shibuya,Matteo Comin

doi:10.1186/s12859-019-2883-5

Abstract

MotivationCurrent NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling.ResultsWe use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy.We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources.Availabilityhttps://github.com/yhhshb/yalff

Highlights

Modern sequencing technologies produce large amount of data compared to the older machines
We present Yet another lossy fastq filter (YALFF) (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files
In this paper we present YALFF (Yet Another Lossy Fastq Filter), a reference-based quality score compressor based on k-mers and the Burrows-Wheeler Transformation (BWT) [21], that is capable to improve compression while introducing low distortion into the processed data

Summary

Introduction

Modern sequencing technologies produce large amount of data compared to the older machines. A single run can produce dozens of gigabytes, but in the near future the amount of data is going to grow in the orders of terabytes [1]. This poses the serious question of how to efficiently store and transmit these huge data sets, especially in anticipation of widespread adoption of personalized medicine and machine learning tasks. The preferred files in which data are stored by sequencers is the well known FASTQ format. It is a textual file containing, for each read, an identifier, the nucleotide sequence, and a quality string.

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 1, 2019
Citations: 6	License type: open-access

R Discovery Prime

R Discovery Prime

Better quality score compression through sequence-based quality smoothing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

ACO:lossless quality score compression based on adaptive coding order
Yi Niu ... Guangming Shi
BMC Bioinformatics | VOL. 23
Yi Niu, et. al.Yi Niu ... Guangming Shi
07 Jun 2022
BMC Bioinformatics | VOL. 23

A Two-Level Scheme for Quality Score Compression.
Jan Voges ... Jörn Ostermann
Journal of computational biology : a journal of computational molecular cell biology | VOL. 25
Jan Voges, et. al.Jan Voges ... Jörn Ostermann
30 Jul 2018
Journal of computational biology : a journal of computational molecular cell biology | VOL. 25

GeneCodeq: quality score compression and improved genotyping using a Bayesian framework.
Daniel L Greenfield ... Alban Rrustemi
Bioinformatics (Oxford, England) | VOL. 32
Daniel L Greenfield, et. al.Daniel L Greenfield ... Alban Rrustemi
26 Jun 2016
Bioinformatics (Oxford, England) | VOL. 32

Denoising of Quality Scores for Boosted Inference and Reduced Storage.
Idoia Ochoa ... Euan Ashley
Proceedings. Data Compression Conference | VOL. 2016
Idoia Ochoa, et. al.Idoia Ochoa ... Euan Ashley
01 Mar 2016
Proceedings. Data Compression Conference | VOL. 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Better quality score compression through sequence-based quality smoothing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics