Abstract
Pseudouridine (Ψ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific Ψ site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naive Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at https://github.com/biyue1026/EnsemPseU.
Highlights
Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction
To develop a more effective model for identifying sites, we propose an ensemble model called EnsemPseU that integrates support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) based on a majority voting strategy
To measure the performance of our model, we used four metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and the Matthew’s correlation coefficient (MCC), which have been used in a series of studies to evaluate the effectiveness of predictors [33]–[35]
Summary
Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. In 2015, Li et al built the first computational model called PPUS to predict the pseudouridine-synthasespecific sites in H. sapiens and S. cerevisiae [7] They used the nucleotides around as features and employed SVM as the classifier. The following year, Chen et al developed another model called iRNA-PseU to identify sites in H. sapiens, S. cerevisiae, and M. musculus, and employed SVM as the classifier [8]. They considered the combination of the occurrence frequency density distributions of the nucleotides and their chemical properties into the general form of pseudo k-tuple nucleotide composition (PseKNC) as feature vectors.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.