Abstract

As one of the widespread RNA post-transcriptional modifications (PTCMs), 5-Methylcytosine (m5C) plays vital roles in better understanding of basic biological mechanisms and major disease treatments. In experiments, traditional high-throughput approaches to find m5C sites are usually expensive and laborious. Additionally, facing with a large number of RNA sequences, developing accurate computational methods to distinguish m5C and non-m5C sites is an efficient solution. Here we introduced a novel predictor, called iRNA-m5C_NB, to identify m5C sites in Home sapiens using Naive Bayes (NB) algorithm. In this method, unbalanced dataset Met935 is firstly analyzed using efficient hybrid-sampling strategy SMOTEEEN. Then top 57 features are selected by the ANOVA F-value from four kinds of well-performance feature extraction techniques, including Bi-profile Bayes (BPB), enhanced Nucleic Acid Composition (ENAC), electron-ion interaction pseudopotentials (EIIP) and mMGap_1. Based on the jackknife test, the evaluated recall for the unbalanced training dataset Met935 is up to 82.81% with MCC of 0.63. And for the independent dataset Test1157, the predictor still shows high recall of 70.06% and MCC of 0.34. It is the first m5C predictor constructed using the unbalanced dataset, and the recall scores are increased by 19.82% and 59.23% for jackknife and independent tests compared with the latest tool RNAm5CPred, respectively. We demonstrate that the proposed predictor iRNA-m5C_NB outperforms other state-of-art models, which hopes to be an efficient and reliable method to identify m5C sites.

Highlights

  • M5C can be formed on carbon atom by the catalysis of RNA methyltransferase, where a methyl group is attached in the 5th position of the cytosine (C) ring [5]

  • Dou et al.: iRNA-m5C_NB: Novel Predictor to Identify RNA 5-Methylcytosine Sites Based on the Naive Bayes (NB) Classifier

  • We focused on the identification of RNA m5C sites in H. sapiens using the unbalanced dataset Met935 and Test1157

Read more

Summary

Introduction

For iRNAm5C-PseDNC using the unbalanced dataset Met1900 (475 positive and 1425 negative samples), there are large amount of redundant sequences with the accuracy and MCC achieve 92.37% and 0.79. The high accuracies (more than 93%) were reported using the balanced dataset over jackknife test, it is an urgent need to construct the high-performance model using the unbalanced data based on the fact that the m5C sites is distributed unbalanced.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call