Abstract

A new super-secondary structure dataset with sequence identity < 25%, resolution better than 2.0 A, which contained 2805 non-redundant protein chains and could be classified into four motifs, is built for prediction. The matrix scoring values and hydropathy distribution are extracted from the protein sequences, which are then input to the random forest algorithm to predict the super-secondary structure in proteins. The predictive overall accuracy is 72.71% and 68.54% by fivefold cross-validation and independent dataset test, respectively. The proposed method is also tested on a previous independent test dataset and the predictive overall accuracy 84.89%, which is better than the performance of the previous predictions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call