Quantifying the effect of mutations in the BRCA1 gene is useful for understanding their clinical consequences on breast cancer. Machine learning models can be applied to predict the landscape of protein variant effects that might not be always accessible by experiments. In this work, we propose a simple semi-supervised learning method using a Gaussian mixture model to predict ∼90% of the unlabeled missense variants of the BRCA1 gene collected from the ClinVar database. High-quality embeddings are used as a feature of the protein sequences, extracted using the latest pre-trained transformer-based language protein model. A statistical test show that the protein embeddings are effective and robust for predicting pathogenicity. Further, the lower representations of the features are then fed into the semi-supervised model. The prediction performance of the model only for the labeled testing data achieves an AUC score and an accuracy of 79.27% and 71.58%, respectively. Using our defined pathogenic probability score, we find that ∼94% of variants in our unlabeled dataset are well-separated into either benign or pathogenic classes according to that scoring. Our scores obtain a moderate Spearman rank correlation with the results of established unsupervised variant effect models. Finally, our approach can potentially be developed for more accurate and biologically reliable predictions of the variant effects.
Read full abstract