Abstract

RNA-proteins interaction is essential for the regulation of gene expression, cell defense and developmental regulation and other life activities, so applying machine learning to predict RNA-binding proteins (RBPs) has become a research hotspot in bioinformatics. We propose a new method to predict RNA-binding proteins called RBPro-RF. First, the feature vectors of the protein sequence are extracted by fusing composition-transition-distribution (C-T-D), pseudo-amino acid composition (PseAAC) and position-specific scoring matrix-400 (PSSM-400). Secondly, the synthetic minority oversampling technique (SMOTE) and the edited nearest neighbor (ENN) are employed to balance samples. Then, elastic net (EN) is used to eliminate redundant features and retain the important features to represent RBPs. Finally, the optimal feature vectors are input into random forest classifier to predict RBPs. Ten-fold cross-validation indicates the ACC and MCC of the training set are 97.43% and 0.933, respectively. In addition, the accuracies of three independent test sets Human, S. cerevisiae and A. thaliana are 95.63%, 88.82%, and 92.35%, respectively, which are superior to the state-of-the-art prediction methods. In summary, experimental results show that our method can significantly improve the accuracy of RNA-binding proteins prediction. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/RBPro-RF/.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call