Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.

Yang Qu,Jianping Zheng,Yitian Zhao,Bing Bai,Jianwei Ma,Tong Kong,Zitong Niu,Qiaojiao Ding,Taowa Zhao

doi:10.3390/ijms242216496

Abstract

Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Molecular Sciences	Publication Date: Nov 18, 2023
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.

Abstract

Talk to us

Similar Papers

More From: International Journal of Molecular Sciences

Lead the way for us

Similar Papers

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

Research on protein solubility prediction based on ensemble learning and feature fusion
-
Academic Journal of Computing & Information Science | VOL. 5
--
01 Jan 2021
Academic Journal of Computing & Information Science | VOL. 5

Optimization techniques for sentiment analysis based on LLM (GPT-3)
Tong Zhan ... Huixiang Li
Applied and Computational Engineering | VOL. 67
Tong Zhan, et. al.Tong Zhan ... Huixiang Li
31 May 2024
Applied and Computational Engineering | VOL. 67

Implementation of a Large-Scale Language Model in a Cloud Environment for Human–Robot Interaction
Dae-Young Jung ... Hyung-Bae Jeon
-
Dae-Young Jung, et. al.Dae-Young Jung ... Hyung-Bae Jeon
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.

Abstract

Talk to us

Similar Papers

More From: International Journal of Molecular Sciences