PrPred: A Predictor to Identify Plant Resistance Proteins by Incorporating k-Spaced Amino Acid (Group) Pairs.

Yansu Wang,Yingjie Guo,Yu Chen,Lei Xu,Pingping Wang,Shan Huang

doi:10.3389/fbioe.2020.645520

Abstract

To infect plants successfully, pathogens adopt various strategies to overcome their physical and chemical barriers and interfere with the plant immune system. Plants deploy a large number of resistance (R) proteins to detect invading pathogens. The R proteins are encoded by resistance genes that contain cell surface-localized receptors and intracellular receptors. In this study, a new plant R protein predictor called prPred was developed based on a support vector machine (SVM), which can accurately distinguish plant R proteins from other proteins. Experimental results showed that the accuracy, precision, sensitivity, specificity, F1-score, MCC, and AUC of prPred were 0.935, 1.000, 0.806, 1.000, 0.893, 0.857, and 0.948, respectively, on an independent test set. Moreover, the predictor integrated the HMMscan search tool and Phobius to identify protein domain families and transmembrane protein regions to differentiate subclasses of R proteins. prPred is available at https://github.com/Wangys-prog/prPred. The tool requires a valid Python installation and is run from the command line.

Highlights

Plant pathogens can disturb the plant immune system to support their growth and development within plant tissue
To determine the optimal algorithms and k value, we explored the discrimination power of k = 3, 5, 7, 9, and 13-spaced amino acid pairs using different algorithms (e.g., logistic regression (LR), K-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), decision tree (DT), gradient boosting classifier (GBC), Adaboost, and extra-tree classifier (ETC)) (Supplementary Table 2)
We observed that SVM achieved better performance than other algorithms in 10-fold cross-validation tests in the same k-value

Summary

INTRODUCTION

Plant pathogens can disturb the plant immune system to support their growth and development within plant tissue. RGAugury identifies different subclasses of R proteins, including membraneassociated receptors (RLPs or RLKs) and NBS-containing proteins, by integrating the results generated from several computing programs, such as BLAST (Camacho et al, 2009), InterProScan (Zdobnov and Apweiler, 2001), HMMER3 (Eddy, 2011), nCoil (Lupas et al, 1991), and Phobius (Käll et al, 2004). Machine learning-based methods, NBSPred and DRPPP, are used for the detection of R proteins based on SVM by considering various numerical representation schemes of protein sequences. DRPPP was built by extracting various features from input protein sequences, and the model achieved 91.11% accuracy for prediction plant R proteins. We developed an accurate computational approach for identifying R proteins using various sequence features. Support vector machine (SVM) and 5-spaced amino acid (group) pairs were chosen and applied to construct classifiers with sequence features

MATERIALS AND METHODS

RESULTS

CONCLUSION

DATA AVAILABILITY STATEMENT