Masking as an effective quality control method for next-generation sequencing data analysis.

Sajung Yun,Sijung Yun

doi:10.1186/s12859-014-0382-2

Abstract

BackgroundNext generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with ‘N’s (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths.ResultsWe demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming.ConclusionsWe recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773).

Highlights

Generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions
It deletes base calls in a next generation sequencing (NGS) read such that there remains a contiguous string of bases with quality scores above a user defined cutoff threshold or until the average quality of the remaining reads falls below the threshold value
Trimming and masking on alignments The total number of reads was decreased with trimming by 7.1% from 27,528,260 reads to 25,570,162 reads (Table 1)

Summary

Introduction

Generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. We compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Bioinformatic quality control methods have been introduced into NGS analysis pipeline to increase the accuracy of simple nucleotide variation (SNV) calls that include single nucleotide polymorphism (SNP) and insertion and deletion (indel). Trimming is a commonly used bioinformatic quality control method for base calls with low quality It deletes base calls in a NGS read such that there remains a contiguous string of bases with quality scores above a user defined cutoff threshold or until the average quality of the remaining reads falls below the threshold value. Trimming was not as effective as masking in reducing the false-positive rate

Methods

Results

Conclusion