Abstract

Human exome sequences contain 15,000-20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these "variants of uncertain significance." Many in silico classifiers have been developed, of which PolyPhen-2 is highly successful and widely used. PolyPhen-2 uses a naïve Bayes model to synthesize sequence, structural and genomic information. I investigated whether predictive performance could be improved by replacing PolyPhen-2's naïve Bayes model with alternative machine learning methods. Classifiers using the PolyPhen-2 feature set were retrained using extreme gradient boosting (XGBoost), random forests, artificial neural networks, and support vector machines. Classifiers were externally validated on "pathogenic" and "benign" ClinVar variants absent from the training datasets. Software is implemented in Python and is freely available at https://github.com/djparente/polyboost and the Python Package Index (PyPI) under the BSD license. An XGBoost-based classifier-designated PolyBoost (PolyPhen-2 Booster)-improves discriminative performance and calibration relative to PolyPhen-2 in external validation on ClinVar. PolyBoost analyzes PolyPhen-2 output and can be incorporated into existing bioinformatics workflows as a post-analysis method to improve interpretation of clinical exome sequences obtained to identify monogenic disease.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.