EnsemPseU: Identifying Pseudouridine Sites With an Ensemble Approach

Yue Bi,Cangzhi Jia,Dong Jin

doi:10.1109/access.2020.2989469

Yue Bi, Cangzhi Jia + Show 1 more

Open Access

https://doi.org/10.1109/access.2020.2989469

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 16	License type: CC BY 4.0

Affiliation: Dalian Maritime University

Abstract

Pseudouridine (Ψ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific Ψ site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naive Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at https://github.com/biyue1026/EnsemPseU.

Highlights

Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction
To develop a more effective model for identifying sites, we propose an ensemble model called EnsemPseU that integrates support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) based on a majority voting strategy
To measure the performance of our model, we used four metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and the Matthew’s correlation coefficient (MCC), which have been used in a series of studies to evaluate the effectiveness of predictors [33]–[35]

Summary

INTRODUCTION

Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. In 2015, Li et al built the first computational model called PPUS to predict the pseudouridine-synthasespecific sites in H. sapiens and S. cerevisiae [7] They used the nucleotides around as features and employed SVM as the classifier. The following year, Chen et al developed another model called iRNA-PseU to identify sites in H. sapiens, S. cerevisiae, and M. musculus, and employed SVM as the classifier [8]. They considered the combination of the occurrence frequency density distributions of the nucleotides and their chemical properties into the general form of pseudo k-tuple nucleotide composition (PseKNC) as feature vectors.

MATERIALS AND METHODS

FEATURE EXTRACTION

RESULTS AND DISCUSSION

THE RESULTS OF FEATURE SELECTION

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

EnsemPseU: Identifying Pseudouridine Sites With an Ensemble Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Detection of DDoS attack in IoT traffic using ensemble machine learning techniques
Nimisha Pandey ... Pramod Kumar Mishra
Networks and Heterogeneous Media | VOL. 18
Nimisha Pandey, et. al.Nimisha Pandey ... Pramod Kumar Mishra
01 Jan 2023
Networks and Heterogeneous Media | VOL. 18

Performance Comparison between Meta-classifier Algorithms for Heart Disease Classification
Nureen Afiqah Mohd Zaini ... Mohd Khalid Awang
International Journal of Advanced Computer Science and Applications | VOL. 13
Nureen Afiqah Mohd Zaini, et. al.Nureen Afiqah Mohd Zaini ... Mohd Khalid Awang
01 Jan 2021
International Journal of Advanced Computer Science and Applications | VOL. 13

A Risk Prediction Model for Physical Restraints Among Older Chinese Adults in Long-term Care Facilities: Machine Learning Study.
Jun Wang ... Qinghua Zhao
Journal of medical Internet research | VOL. 25
Jun Wang, et. al.Jun Wang ... Qinghua Zhao
06 Apr 2023
Journal of medical Internet research | VOL. 25

Ensemble Model for Diagnostic Classification of Alzheimer's Disease Based on Brain Anatomical Magnetic Resonance Imaging.
Yusera Farooq Khan ... Chiranji Lal Chowdhary
Diagnostics | VOL. 12
Yusera Farooq Khan, et. al.Yusera Farooq Khan ... Chiranji Lal Chowdhary
16 Dec 2022
Diagnostics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EnsemPseU: Identifying Pseudouridine Sites With an Ensemble Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access