Abstract In cancer genomics, precise variant annotation is crucial for clinical decisions, drug development, and research. The burgeoning genomic data offers an opportunity to use data-driven approaches to generate knowledge that supports clinical decisions. Particularly machine learning (ML) and Deep learning (DL), are becoming essential, as their application is fast, scalable, simple to implement, and generates reproducible results. The methods ranging from simple sequence-based alignment scoring to advanced algorithms like Logistic regression, Support vector machine, and Recurrent neural networks, have been employed by multiple variant annotation tools. This study compares in-silico methods available in Ensembl's Variant Effect Predictor (VEP) using a test dataset with COSMIC annotations. ML/DL success relies on robust training sets with comprehensive genomic variants features, including effects on transcription/translation, genomic context, annotation resources, in silico pathogenicity predictions, and population allele frequency. The training set, composed of known benign or pathogenic variants, serves as a reference for these algorithms to classify new and unseen variants. Our analysis reveals a limited concordance between the prediction algorithms. Despite comparable true/false positives/negatives, discrepancies persist in variant classification. Certain algorithms exhibit a propensity to over or under-call deleterious mutations. Some demonstrate a tendency to classify random variants in non-cancer genes as deleterious. Challenges include the absence of consensus on informative features, diverse training datasets, and restriction to well annotated proteins/transcripts. Balancing the sensitivity and false positives in detecting cancer drivers is crucial. Integrating individual prediction scores with ML algorithms enhances tool performance but comes with risk of error propagation, and limited accuracy. The study emphasises the need for context-specific variant classification tools, as many variants' impacts are cancer-type specific, and some may drive disease synergistically. Existing tools, designed for a "one variant - one score approach," struggle to capture complex associations, especially those dependent on changes in the tumour microenvironment. Highlighting areas for improvement, the study addresses the "black box" problem in decision processes. While limited interpretability might not hinder practical applications, tools should evolve to assess more complex associations guided by biology. Formal consensus, reference training datasets, and standards are deemed essential for developing next-generation tools. The envisioned context-dependent tools aim to streamline feature complexity, thereby mitigating the black box problem and advancing the accuracy and interpretability of cancer variant annotation. Citation Format: Madhumita, Zbyslaw Sondka, Jon Teague. Evaluating the utility of in silico variant annotation tools for cancer driver detection [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 4884.
Read full abstract