Abstract
BackgroundThe processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.ResultsWe developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.ConclusionsOur findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.
Highlights
The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development
Pipeline development We developed a modular pipeline for processing NGS as shown in Figure 1 and described in Additional file 1 and in more detail at our Wiki site
Variant quality score recalibration versus hard filter Moving forward with Genome Analysis Toolkit (GATK), we examined the accuracy of calls when using hard filtering with recommended thresholds from GATK; a full description is provided in Additional file 1 versus using GATK's Variant Quality Score Recalibration (VQSR), which builds a Gaussian mixture model by looking at the annotation values over a high-quality subset of the input call set and uses this model to evaluate all input variants
Summary
The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. We validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal. Given accurately mapped and calibrated reads, identifying simple SNPs, let alone more complex variation such as multiple base pair substitutions, insertions, deletions, inversions, and copy number variation, requires complex statistical models and sophisticated bioinformatics tools to implement these models on large amounts of data [16,18]. Many questions remain about how well these different tools work in identifying and accurately calling sequence variation and what are the best strategies for optimizing their use. Several recent studies have begun to evaluate and compare the performance of these tools [23,24,25]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have