Abstract
Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment—in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants.Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality.Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can ‘resurrect’ many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms.Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/.Contact: matthew.ruffalo@case.edu.
Highlights
Next-generation genome sequencing (NGS) has quickly become very popular in life sciences because of its utility in efficiently generating high-quality sequence data (Meyerson et al, 2010)
A false positive (FP) is a read that is incorrectly mapped but whose score exceeds this threshold, and a false negative (FN) is a read that is correctly mapped but is discarded because its score is less than the threshold
We evaluate the performance of LoQuM on multiple alignment tools and compare the classifier output with the raw mapping quality
Summary
Next-generation genome sequencing (NGS) has quickly become very popular in life sciences because of its utility in efficiently generating high-quality sequence data (Meyerson et al, 2010). Many computational methods are already available for analyzing genetic variants using NGS data These variants include single-nucleotide polymorphisms (SNPs) and structural variants such as copy numbers, insertions, deletions, tandem duplications, inversions and translocations. The first step in the detection and analysis of genetic variants is usually the alignment of short NGS reads from an individual’s (donor) genome to a reference genome This task poses significant computational challenges due to the large number of reads (sometimes tens of millions) and the size of many reference genomes (generally on the order of billions). Many software tools have been developed to address these challenges and efficiently and accurately align short reads to the reference genome (Ruffalo et al, 2011) These tools include BWA (Li and Durbin, 2009), SOAP (Li et al, 2009c), Novoalign (Novocraft, 2010) and mr(s)FAST (Alkan et al, 2009; Hach et al, 2010)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have