Information entropy-based differential evolution with extremely randomized trees and LightGBM for protein structural class prediction

Yu Zhang,Yirui Wang,Zhenyu Lei,Shangce Gao,Pengxing Cai

doi:10.1016/j.asoc.2023.110064

Abstract

The discovery of protein tertiary structure is the basis of current genetic engineering, medicinal design, and other biological applications. Protein structural class plays a significant role in the tertiary structure folding and function analysis of protein. However, the growth rate of new amino acid sequence far exceeds the tertiary structure. Existing research methods of confirming protein folding cannot satisfy massive sequences and protein engineering. A high-accuracy prediction result of low-similarity protein dataset is particularly critical to generate the corresponding tertiary structure from the primary structure. In this paper, we construct a novel super-large-scale feature of the primary structure based on secondary structure, evolutionary information, chemical properties, and global descriptors. The diversified and massive features are utilized to predict the protein class based on a novel feature selection algorithm and a gradient boosting decision tree model. To testify the effectiveness and robustness of our proposed method, namely IDEGBM, we choose the 10-fold cross-validation for evaluating four benchmark datasets 25PDB, FC699, D1189 and D640. Experimental results exhibit that our method improves the accuracy in comparison with other state-of-the-art prediction models in terms of both accuracy and efficiency. Furthermore, a representative protein is used to validate that our proposed IDEGBM can be applied to improve the conformation prediction of protein tertiary structure.

Full Text