Valection: design optimization for validation and verification studies

Christopher I Cooper,Smc-Dna Challenge Participants ,Dorota H Sendorek,Takafumi N Yamaguchi,Kyle Ellrott,Robert G Bristow,Michael Fraser,Delia Yao,Paul C Boutros,Adam A Margolin,Cristian Caloian,Joshua M Stuart,Christine P’Ng,Kathleen E Houlahan

doi:10.1186/s12859-018-2391-z

Abstract

BackgroundPlatform-specific error profiles necessitate confirmatory studies where predictions made on data generated using one technology are additionally verified by processing the same samples on an orthogonal technology. However, verifying all predictions can be costly and redundant, and testing a subset of findings is often used to estimate the true error profile.ResultsTo determine how to create subsets of predictions for validation that maximize accuracy of global error profile inference, we developed Valection, a software program that implements multiple strategies for the selection of verification candidates. We evaluated these selection strategies on one simulated and two experimental datasets.ConclusionsValection is implemented in multiple programming languages, available at: http://labs.oicr.on.ca/boutros-lab/software/valection

Highlights

Platform-specific error profiles necessitate confirmatory studies where predictions made on data generated using one technology are verified by processing the same samples on an orthogonal technology
In the ‘weighted’ mode, precision scores are modified so that unique calls carry more weight than calls predicted by multiple callers. This places more emphasis on true positive calls that are unique to a single submission (i.e. Single-nucleotide variant (SNV) that are more difficult to detect) over those that are found across multiple submissions
This is important to consider, given that one key goal of SNV calling is to maximize the number of true mutations detected

Summary

Introduction

Platform-specific error profiles necessitate confirmatory studies where predictions made on data generated using one technology are verified by processing the same samples on an orthogonal technology. Error rates can vary significantly between studies because of tissue-specific characteristics, such as DNA quality and sample purity, and differences in data processing pipelines and analytical tools. Variations in normal tissue contamination can further confound genomic and transcriptomic analyses [8,9,10]. Taken together, these factors have necessitated the wide-spread use of studies with orthogonal technologies, both to verify key hits of interest and to quantify the global error rate of specific pipelines. The underlying concept is that if the second technique has separate error profiles from the first, a comparative analysis can readily identify false positives (e.g. in inconsistent, low quality calls) and even begin to elucidate the false negative rate (e.g. from discordant, high quality calls)

Methods

Results

Discussion

Conclusion