Abstract

Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.

Highlights

  • Heat shock proteins (HSPs) are ubiquitous in living organisms

  • The results indicate that the combined parameter split amino acid composition (SAAC)+dipeptide composition (DC)+conjoint triad feature (CTF) +pseudoaverage chemical shift (PseACS) with syntactic minority oversampling technique (SMOTE) was helpful in enhancing predictive performance

  • An optimized classifier for HSP family identification was developed. This model was derived from the support vector machine (SVM) machine learning algorithm, and SMOTE was used for the imbalanced data classification problems

Read more

Summary

Introduction

Heat shock proteins (HSPs) are ubiquitous in living organisms. They act as molecular chaperones by facilitating and maintaining proper protein structure and function [1,2,3,4]; in addition, they are involved in various cellular processes such as protein assembly, secretion, transportation, and protein degradation [5, 6]. Feng et al developed a predictor called “iHSP-RAAAC” that selected the reduced amino acid alphabet (RAAA) as a feature vector; the overall predictive accuracy was 87.42% with the jackknife test [21]. Ahmad et al used the split amino acid composition (SAAC), the dipeptide composition (DC), and PseAAC [22, 23] to Computational and Mathematical Methods in Medicine identify HSPs; the highest overall predictive accuracy was 90.7% with the jackknife test [24]. The split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were used to predict the HSPs with the same datasets as investigated by Feng et al Data imbalance is always considered a problem in developing efficient and reliable prediction systems; due to an imbalanced dataset, the classifier would tend towards the majority class. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call