Abstract

Protein secondary structure (PSS) describes the local folded structures which get formed inside a polypeptide due to interactions among atoms of the backbone. Generally, globular proteins are divided into four classes, namely all-α, all-β, α + β, andα/β. As nearly 90% of proteins fall into the said four classes, these are mostly considered for the purpose of computational classification of proteins. Classification of PSS is important for different biological functions that include protein fold recognition, tertiary structure prediction, prediction of DNA-binding sites, and reduction of the conformation search space among others. In this paper, we have proposed a machine learning-based model for secondary structure classification of proteins into four classes: all-α, all-β, α + β, andα/β. In doing so, we have considered both sequence-based and structure-based features. At first, mutual information (MI), a filter-based feature selection method, is used to remove the redundant features, and then these selected features are used to train three different classifiers-random forest, K-nearest neighbor (KNN), and multi-layer perceptron (MLP). After that, some standard classifier combination approaches are applied to integrate the decision made by the said classifiers and it has been found that weighted product rule performs the best among all. The overall accuracies obtained using the proposed model on the four standard datasets, namely 640, 1189, 25pdb, and fc699 are 86.89%, 92.93%, 91.38%, and 94.87% respectively. The proposed model outperforms some state-of-the-art methods considered here for comparison. Significantly high classification accuracy produced by our proposed model on four datasets is attributed to the development of a comprehensive feature set (by eliminating redundant features through feature selection technique) which is then passed through an ensemble consists of three different classifiers. Assigning different weights to the outcome of different classifiers thus proved to be useful in designing the model for predicting the secondary structure of proteins based on its sequence-based and structure-based features. Graphical abstract.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.