Abstract
Protein Secondary Structural Class (PSSC) information is important in investigating further challenges of protein sequences like protein fold recognition, protein tertiary structure prediction, and analysis of protein functions for drug discovery. Identification of PSSC using biological methods is time-consuming and cost-intensive. Several computational models have been developed to predict the structural class; however, they lack in generalization of the model. Hence, predicting PSSC based on protein sequences is still proving to be an uphill task. In this article, we proposed an effective, novel and generalized prediction model consisting of a feature modeling and an ensemble of classifiers. The proposed feature modeling extracts discriminating information (features) by leveraging three techniques: (i) Embedding - features are extracted on the basis of spatial residue arrangements of the sequences using word embedding approaches; (ii) SkipXGram Bi-gram - various sets of skipped bi-gram features are extracted from the sequences; and (iii) General Statistical (GS) based features are extracted which covers the global information of structural sequences. The combined effective sets of features are trained and classified using an ensemble of three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM). The proposed model when assessed on five benchmark datasets (high and low sequence similarity), viz. z277, z498, 25PDB, 1189, and FC699, reported an overall accuracy of 93.55, 97.58, 81.82, 81.11, and 93.93 percent respectively. The proposed model is further validated on a large-scale updated low similarity ( ≤ 25%) dataset, where it achieved an overall accuracy of 81.11 percent. The proposed generalized model is robust and consistently outperformed several state-of-the-art models on all the five benchmark datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Computational Biology and Bioinformatics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.