Protein structure holds immense potential for pathogenicity prediction, albeit structure-based predictors are limited compared to the sequence-based counterparts due to the "structure knowledge gap" between large number of available protein sequences and relatively limited number of structures. Leveraging the highly accurate protein structures predicted by AlphaFold2 (AF2), we introduce AFFIPred, an ensemble machine learning classifier that combines sequence and AF2-based structural characteristics to predict missense variant pathogenicity. Based on the assessments on unseen datasets, AFFIPred reached a comparable level of performance with the state-of-the-art predictors such as AlphaMissense. We also showed that the recruitment of AF2 structures that are full-length and represent the unbound states ensures more precise SASA calculations compared to the recruitment of experimental structures. In line with the completeness of the AF2 structures, their use provide a more comprehensive view of the structural characteristics of the missense variation datasets by capturing all variants. AFFIPred maintains high-level accuracy without the limitations of PDB-based classifiers. AFFIPred has predicted over 210 million variations of the human proteome, which are accessible at https://affipred.timucinlab.com/.
Read full abstract