A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence.

Sijie Yao,Peng Chen,Chunhou Zheng,Bing Wang

doi:10.1007/s00726-022-03129-5

Abstract

Protein hot spot residues are functional sites in protein-protein interactions. Biological experimental methods are traditionally used to identify hot spot residues, which is laborious and time-consuming. Thus a variety of computational methods were widely used in recent years. Despite the success of computational methods in hot spot identification, most of them are impractical in reality because they can recognize hot spot residues only from known protein-protein interface residues. Therefore, identifying hot spots from whole protein sequence is a meaningful and interesting issue. However, it will bring extreme imbalance between positive and negative samples. Hot spot residues only account for about 1-2% of whole protein sequences. To address the issue, this paper proposes a two-step ensemble model for identifying hot spot residues from extremely unbalanced data set. The model is composed of 134 classifiers constructed by base KNN and SVM. Compared to the previous methods, our model yields good performance with an F1 score of 0.593 on the BID test set. Furthermore, to validate the robustness of our model, it was tested on other three independent test sets and also achieved good predictions. More importantly, the performance of our model tested on unbalanced data set is comparable with other methods tested on balanced hot spot data set.

Full Text