Protein solubility is one of the momentous properties of a protein that can effectively participate in and inhibit the physiological and biochemical processes of cancer cells in the human body. Therefore, understanding the solubility of proteins may be significant to find the mechanism of diseases caused by the solubility of proteins. In this paper, to improve the protein solubility prediction performance and address the inadequacy of existing protein solubility prediction methods that more feature information about protein sequences is difficult to be obtained. A protein solubility prediction model named EL-FFsol is proposed, which is based on the CatBoost ensemble learning framework and multiple feature fusion of protein sequences. First of all, protein sequence features were introduced to build fusion representation, including the Physicochemical Properties, One-hot Feature Encoding, Amino Acid Composition and Statistical Features. Additionally, the CatBoost was employed to construct an ensemble learning model to predict protein solubility. Finally, EL-FFsol was tested on the benchmark dataset to predict the solubility of proteins. In terms of accuracy, matthews correlation coefficient, sensitivity, specificity, area under ROC curve and area under P-R curve, EL-FFsol achieved 0.7679, 0.5480, 0.6630, 0.8729, 0.8540 and 0.8440 performances. Compared with the DeepSOL and DDcCNN, the matthews correlation coefficient was increased by 1.68% and 0.79%, the area under ROC curve was increased by 1.60% and 2.20% and the area under P-R curve was increased by 1.70% and 2.40%, respectively.
Read full abstract