Abstract

Pseudouridine (Ψ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific Ψ site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naive Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at https://github.com/biyue1026/EnsemPseU.

Highlights

  • Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction

  • To develop a more effective model for identifying sites, we propose an ensemble model called EnsemPseU that integrates support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) based on a majority voting strategy

  • To measure the performance of our model, we used four metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and the Matthew’s correlation coefficient (MCC), which have been used in a series of studies to evaluate the effectiveness of predictors [33]–[35]

Read more

Summary

INTRODUCTION

Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. In 2015, Li et al built the first computational model called PPUS to predict the pseudouridine-synthasespecific sites in H. sapiens and S. cerevisiae [7] They used the nucleotides around as features and employed SVM as the classifier. The following year, Chen et al developed another model called iRNA-PseU to identify sites in H. sapiens, S. cerevisiae, and M. musculus, and employed SVM as the classifier [8]. They considered the combination of the occurrence frequency density distributions of the nucleotides and their chemical properties into the general form of pseudo k-tuple nucleotide composition (PseKNC) as feature vectors.

MATERIALS AND METHODS
FEATURE EXTRACTION
RESULTS AND DISCUSSION
THE RESULTS OF FEATURE SELECTION
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call