Abstract

A novel approach is developed to address the challenge of annotating with phenotypic effects those exome variants for which relevant empirical data are lacking or minimal. The predictive annotation method is implemented as a stacked ensemble of supervised base-learners, including distributed random forest and gradient boosting machines. Ensemble models were trained and cross-validated on evidence-based categorical variant effect annotations from the ClinVar database, and were applied to 84 million non-synonymous single nucleotide variants (SNVs). The consensus model combined 39 functional mutation impacts, cross-species conservation score, and gene indispensability score. The indispensability score, accounting for differences in variant pathogenicities including in essential and mutation-tolerant genes, considerably improved the predictions. The consensus combination is consistent with as many input scores as possible while minimizing false predictions. The input scores are ranked based on their ability to predict effects. The score rankings and categorical phenotypic variant effect predictions are aimed for direct use in clinical and biological applications to prioritize human exome variants and mutations.

Highlights

  • Accurate and exhaustive annotation of human gene variants is important for every application of NGS technology, including development of therapies, selection of an effective individualized therapy, and comparing multiple samples in biological/clinical studies

  • For the purpose of illustrating the usefulness of combining different scores, we show how the variants belonging to different pathogenicity classes can be progressively better discriminated, from 1- to 2-dimensional space using one or two selected features, respectively

  • To give quantitative references: in the consensus approach, since the correlations between the target annotations and each individual score are in the 75–82% range (Table 2), and since the consensus VEP (cVEP) predictions have 98% match with the target, the resulting annotations are in line with the input scores to the nearly same degree as the targets (i.e., 73–80% correlation)

Read more

Summary

Introduction

Accurate and exhaustive annotation of human gene variants is important for every application of NGS technology, including development of therapies, selection of an effective individualized therapy, and comparing multiple samples in biological/clinical studies. Among the many types of annotations that can be assigned to each sequenced NGS variant, of particular interest are those predicting if a variant has a benign or deleterious phenotypic effect. We briefly overview the state-of-the-art approaches in annotating with phenotypic effects, focusing mostly on an array of predicted pathogenicity scores, and on population allele frequencies, as it is very popular. Genes 2020, 11, 1076 for estimation of pathogenicity We list their main limitations and describe an approach proposed in this work with an eye to solving those limitations, followed by a brief overview of the method’s results

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call