Abstract

Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906–0.910) and 2–17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.

Highlights

  • Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry

  • Web server a­ vailabilityd No No No Not accessible Yes No No Not accessible No No No No Yes shape, Gibbs free energy change of hydration in native proteins, dipeptide composition, contacts between amino acid residues, number of ion pairs, hydrogen bonds, packing, and aromatic clusters all play an important role in Thermophilic proteins (TPPs) ­stability[5,7]

  • In 2011, Lin et al.[20] constructed a more reliable benchmark dataset containing 915 TPPs and 793 non-TPPs. Using this dataset, ThermoPred was developed by means of the support vector machine (SVM) method in conjunction with amino acid composition (AAC) and dipeptide composition (DPC), which could achieve an improvement in accuracy (ACC) of 0.933 as evaluated by the jackknife cross-validation in their comparative analysis with the model of Gromiha et al.[27]

Read more

Summary

Introduction

Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. Zhang and F­ an[31] developed the first TPP predictor based on amino acid composition (AAC) descriptors. In 2011, Lin et al.[20] constructed a more reliable benchmark dataset containing 915 TPPs and 793 non-TPPs (called Lin2011) Using this dataset, ThermoPred was developed by means of the SVM method in conjunction with AAC and dipeptide composition (DPC), which could achieve an improvement in accuracy (ACC) of 0.933 as evaluated by the jackknife cross-validation in their comparative analysis with the model of Gromiha et al.[27]. Fan et al.[25] introduced a new TPP predictor (called PSSM400_pKa) based on the SVM method and trained on three different feature encodings namely AAC, acid dissociation constant (pKa) and position-specific scoring matrices (PSSM). These datasets might not have sufficient information necessary for Scientific Reports | (2021) 11:23782 |

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call