Abstract Several state-of-the-art, easy to use tools are available both for short-variant detection (e.g. GATK, FREEBAYES), and structural variant (SV) detection (e.g. LUMPY, MANTRA, DELLY), but these tools often produce divergent variant calls, especially INDELs, and it is very difficult to reconcile such variants into a single, accurate set. Furthermore, while it would be highly desirable to also detect larger, structural variants (SV), existing SV detector packages are typically difficult to integrate, highly resource-intensive to run, and result in call sets that require expert manual review to reduce false positive detection rate. Our algorithm, GRAPHITE (https://github.com/dillonl/graphite) requires as input a collection of variant calls, made by one or more short-variant or SV detection tools. Typically, this starting set is high sensitivity (i.e. inclusive), but low specificity (i.e. have a high false discovery rate). We then apply a novel “variant adjudication” procedure to discard false positives, while keeping true positive calls. This is accomplished by constructing a graph from these variants (the Variant Graph) representing allelic variants as graph branches, in addition to the branches formed by the current, linear genome reference sequence. Using a graph mapping algorithm (GSSW, a graph extension of the Smith-Waterman alignment algorithm) we developed earlier, we re-map all reads from each of the samples contributing to the candidate calls. We retain candidate variants confirmed by mappings to those branches in the graph that represent the corresponding variant allele, and discard those candidates that were not confirmed by such mappings. This procedure results in a highly specific callset that also maintains the high sensitivity of the inclusive starting callset constructed by multiple primary variant calling methods. Because the graph construction and mapping approach works for most types of SVs in addition to all short variants, variants of all different types can be integrated in a single step. Here we present the application of this method for cross-validating structural variants calls from Pacific Biosciences data by remapping deep Illumina WGS read sets to Variant Graphs constructed using the candidate Pacific Biosciences variants, as part of the Human Genome Structural Variation Consortium (HGSVC) data analysis project. We also present GRAPHITE's application to improving the accuracy of allele frequency measurement in tumor sequencing data, which is essential for the accurate reconstruction of subclonal evolution in longitudinal tumor samples. Citation Format: Dillon Lee, Yi Qiao, Gabor Marth. A graph remapping framework for in silico adjudication of SNVs, INDELs, and structural variants from genetic sequencing data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 3272.
Read full abstract