A Novel Machine Learning Based in silico Pathogenicity Predictor for Missense Variants in a Hematological Setting

Stephan Hutter,Constance Baer,Wencke Walter,Wolfgang Kern,Claudia Haferlach,Torsten Haferlach

doi:10.1182/blood-2019-128488

Abstract

Background: Interpreting the pathogenic potential of an amino-acid changing single nucleotide variant (SNV) in a disease related gene can be challenging, especially for rare variants for which little or no information is available in clinical databases. In silico predictors, tools that predict the functional impact of an SNV algorithmically, can be useful in this scenario, and guidelines for variant interpretation recommend their inclusion in the interpretation process. Resources such as the dbNSFP database, which contains pre-calculated prediction scores for dozens of different algorithms, are readily available today. However, individual predictors rarely come to the same conclusion, and even for well-known disease causing SNVs results can be heterogeneous or even contradictory, which complicates their interpretation. Ensemble predictors such as REVEL, MetaLR/SVM or CADD combine the knowledge/information from multiple individual sources. These predictors use machine learning methods and training sets of pre-defined pathogenic and benign SNVs to integrate individual algorithms into a single, easy to interpret score. However, current training sets are based on pathogenic germline variants, which might cause these predictors to underperform when testing somatic variants. Aim: Development of HePPy (Hematological Predictor of Pathogenicity), an ensemble in silico predictor trained on somatic disease causing variants for use in a hematological setting. Methods: We followed the approach laid out by REVEL and used 10 in silico predictor scores and 4 phylogenetic conservation scores from the dbNSFP data base to train a random forest model. Our training set consisted of 371 unique missense SNVs from 61 hematologically relevant genes that were recurrently identified (in at least 10 patients) during routine diagnostics. All were consistently and unambiguously characterized by hematological experts as either a pathogenic somatic variant (n = 268) or a benign germline variant (n = 103) using a rigorous manual classification process within a data set of 69,879 cases studied between 2005 and 2018. Model accuracy was assessed by 10-fold cross-validation and further evaluated using a test data set consisting of 335 rare missense SNVs from routine diagnostics for which control germline material (buccal swabs, finger nail clippings) from the respective patients was available. Variants originating in the germline were expected to be mainly benign (n = 123), while somatic variants were considered pathogenic (n = 212). We compared the performance of this new tool to REVEL, MetaLR/SVM, CADD and the popular individual predictors SIFT and Polyphen2 by generating receiver operating characteristic (ROC) curves and calculating the area under the curve (AUC). Model implementation and analysis was performed using the R libraries "randomForest", "caret" and "pROC". Results: HePPy scores range from 0 (benign) to 1 (pathogenic) and cross-validation on the training set indicates a high accuracy of 0.968, which is also reflected by the clear separation in the distribution of obtained scores for benign and pathogenic training SNVs (see figure B). Application of the model to the test data set of rare SNVs shows that HePPy (AUC = 0.873) outperforms all other prediction tools in separating germline from somatic variants (see figure A). Surprisingly, both MetaLR (AUC = 0.717) and MetaSVM (AUC = 0.703) performed worse than the individual predictors SIFT (AUC = 0.794) and Polyphen2 (AUC = 0.821), while CADD (AUC = 0.831) and REVEL (AUC = 0.850) showed better performance. HePPy scores for somatic test variants were heavily skewed towards very high values (mean = 0.917). Germline variants had significantly lower scores (mean = 0.466), but their distribution was much more uniform than for somatic variants (see figure C). This suggests, to consider a significant proportion of the rare germline variants to have pathogenic potential. This is in line with the growing awareness of pathogenic germline variants and familial predisposition and emphasizes the importance of in silico predictions and other tools to replace the simple "tumor vs. normal" comparison. Summary: We developed HePPy, a new in silico ensemble predictor that is trained on 371 well-defined hematopathological somatic missense variants, which outperforms other currently available methods for in silico prediction in a hematological setting. Figure Disclosures Hutter: MLL Munich Leukemia Laboratory: Employment. Baer:MLL Munich Leukemia Laboratory: Employment. Walter:MLL Munich Leukemia Laboratory: Employment. Kern:MLL Munich Leukemia Laboratory: Employment, Equity Ownership. Haferlach:MLL Munich Leukemia Laboratory: Employment, Equity Ownership. Haferlach:MLL Munich Leukemia Laboratory: Employment, Equity Ownership.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Novel Machine Learning Based in silico Pathogenicity Predictor for Missense Variants in a Hematological Setting

Abstract

Talk to us

Similar Papers

More From: Blood

Lead the way for us

Similar Papers

A Study on Paired Tissue Sequencing in Hematologic Diseases to Distinguish Somatic from Germline Sequence Variants in Routine Diagnostics
Constance Regina Baer ... Torsten Haferlach
Blood | VOL. 128
Constance Regina Baer, et. al.Constance Regina Baer ... Torsten Haferlach
02 Dec 2016
Blood | VOL. 128

Whole Genome Sequencing in Routine Hematologic Samples: How to Proceed Analyses Best When Germline Controls Are Missing?
Stephan Hutter ... Claudia Haferlach
Blood | VOL. 132
Stephan Hutter, et. al.Stephan Hutter ... Claudia Haferlach
29 Nov 2018
Blood | VOL. 132

Analysis For Loss Of 13q Heterozygosity Using STR Or SNP Analysis Can Replace Analysis Of FLT3-ITD To Detect Prognostically Adverse AML
Susanne Schnittger ... Torsten Haferlach
Blood | VOL. 122
Susanne Schnittger, et. al.Susanne Schnittger ... Torsten Haferlach
15 Nov 2013
Blood | VOL. 122

Are the current guidelines for identification of myelodysplastic syndrome with germline predisposition strong enough?
Oriol Calvete ... Jaroslaw P Maciejewski
British Journal of Haematology | VOL. 201
Oriol Calvete, et. al.Oriol Calvete ... Jaroslaw P Maciejewski
30 Jan 2023
British Journal of Haematology | VOL. 201

Journal: Blood	Publication Date: Nov 13, 2019
Citations: 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Machine Learning Based in silico Pathogenicity Predictor for Missense Variants in a Hematological Setting

Abstract

Talk to us

Similar Papers

More From: Blood