Modeling the functional impact of sequence variation is a critical issue for both understanding and developing proteins. An Evolutionary Sequence and Gaussian Mixture Model (ESGMM) for predicting variant pathogenicity is presented in this paper. The model is trained on 2715 clinical proteins and their homologous sequences, using a Transformer-based protein language model to discover evolutionary patterns of amino acids from multiple sequence alignment (MSA). To fully mine deep information of MSA two-dimensional data, an axial attention mechanism is introduced during training. The model estimates the probability of all variants compared to the wild type and calculates variant scores. To categorize variations as pathogenic or benign, a global–local Gaussian mixture model is then constructed for each variant, and ESGMM scores are produced for each variant employing a combination of global and local information. Particle swarm optimization (PSO) is introduced to optimize the local Gaussian mixture model and further quantify the uncertainty of the classification, which enhances the model prediction precision. Experimental results demonstrate the superiority of the optimized ESGMM algorithm in predicting the pathogenicity of variants.
Read full abstract