Abstract

Pseudouridine (Ψ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific Ψ site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naive Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at https://github.com/biyue1026/EnsemPseU.

Highlights

  • Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction

  • To develop a more effective model for identifying sites, we propose an ensemble model called EnsemPseU that integrates support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) based on a majority voting strategy

  • To measure the performance of our model, we used four metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and the Matthew’s correlation coefficient (MCC), which have been used in a series of studies to evaluate the effectiveness of predictors [33]–[35]

Read more

Summary

INTRODUCTION

Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. In 2015, Li et al built the first computational model called PPUS to predict the pseudouridine-synthasespecific sites in H. sapiens and S. cerevisiae [7] They used the nucleotides around as features and employed SVM as the classifier. The following year, Chen et al developed another model called iRNA-PseU to identify sites in H. sapiens, S. cerevisiae, and M. musculus, and employed SVM as the classifier [8]. They considered the combination of the occurrence frequency density distributions of the nucleotides and their chemical properties into the general form of pseudo k-tuple nucleotide composition (PseKNC) as feature vectors.

MATERIALS AND METHODS
FEATURE EXTRACTION
RESULTS AND DISCUSSION
THE RESULTS OF FEATURE SELECTION
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.