Denoising of Aligned Genomic Data

Irena Fischer-Hwang,Mikel Hernaez,Tsachy Weissman,Idoia Ochoa

doi:10.1038/s41598-019-51418-z

Irena Fischer-Hwang, Mikel Hernaez + Show 2 more

Open Access

https://doi.org/10.1038/s41598-019-51418-z

Copy DOI

Abstract

Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE.

Highlights

Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines
We propose a novel denoising method, SAMDUDE, which takes advantage of alignment information contained in the SAM file in order to both denoise reads and update quality scores
We show that the simultaneous reads denoising and quality score updating procedure either maintains or improves variant calling with respect to the original SAM file, while denoising schemes that change only the reads result in degraded variant calling performance

Summary

Introduction

Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. These denoisers attempt to rectify sequencing errors by only changing individual bases in reads, while retaining the original quality scores They are typically tested on simulated and real data sets in FASTQ format, and have been shown to perform well on some of the early stages of genomic sequencing pipelines, such as correcting base calling errors in the simulated data sets, increasing both breadth and depth of reads coverage during alignment[6], or improving de novo assembly of real data sets[7]. We evaluate files that have been denoised using other state-of-the-art denoisers that operate solely on reads in FASTQ files This variant calling comparison methodology has already been used to analyze the effect of lossy compression on quality scores beyond the early steps in a genomic sequencing pipeline[9]. We show that the simultaneous reads denoising and quality score updating procedure either maintains or improves variant calling with respect to the original SAM file, while denoising schemes that change only the reads result in degraded variant calling performance

Objectives

Methods

Results

Conclusion