Abstract
Any single nucleotide variant detection study could benefit from a fast and cheap method of measuring the quality of variant call list. It is advantageous to be able to see how the call list quality is affected by different variant filtering thresholds and other adjustments to the study parameters. Here we look into a possibility of estimating the proportion of true positives in a single nucleotide variant call list for human data. Using whole-exome and whole-genome gold standard data sets for training, we focus on building a generic model that only relies on information available from any variant caller. We assess and compare the performance of different candidate models based on their practical accuracy. We find that the generic model delivers decent accuracy most of the time. Further, we conclude that its performance could be improved substantially by leveraging the variant quality metrics that are specific to each variant calling tool.
Highlights
Identifying single nucleotide variants (SNV) is a major application of next-generation sequencing
A variant caller usually produces a number of variant-level statistics that are meant to be used for downstream variant filtering to adjust the call list quality further
In terms of Estimation of the proportion of true discoveries in single nucleotide variant detection for human data statistical significance, we obtain strong evidence in favor of retaining Het/Hom: if we compare the two models based on their AIC values (7219.47 with Het/Hom and 7230.74 without), the model that contains Het/Hom has Akaike weight of over 99%
Summary
Identifying single nucleotide variants (SNV) is a major application of next-generation sequencing. SNV calling is a multistep process that is not over once a variant caller is invoked. Every variant caller allows the user to specify at least one parameter to adjust the sensitivity of the call list by imposing a threshold on the variant quality score denoted by QUAL in Variant Call Format, VCF. While it is possible to come up with some reasonable filtering thresholds, the ways of observing how different filtering settings impact the quality of the call list (if at all) are fairly limited. Such approaches as verifying the result by applying Sanger sequencing or SNV array are expensive.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.