Abstract
As one of the widespread RNA post-transcriptional modifications (PTCMs), 5-Methylcytosine (m5C) plays vital roles in better understanding of basic biological mechanisms and major disease treatments. In experiments, traditional high-throughput approaches to find m5C sites are usually expensive and laborious. Additionally, facing with a large number of RNA sequences, developing accurate computational methods to distinguish m5C and non-m5C sites is an efficient solution. Here we introduced a novel predictor, called iRNA-m5C_NB, to identify m5C sites in Home sapiens using Naive Bayes (NB) algorithm. In this method, unbalanced dataset Met935 is firstly analyzed using efficient hybrid-sampling strategy SMOTEEEN. Then top 57 features are selected by the ANOVA F-value from four kinds of well-performance feature extraction techniques, including Bi-profile Bayes (BPB), enhanced Nucleic Acid Composition (ENAC), electron-ion interaction pseudopotentials (EIIP) and mMGap_1. Based on the jackknife test, the evaluated recall for the unbalanced training dataset Met935 is up to 82.81% with MCC of 0.63. And for the independent dataset Test1157, the predictor still shows high recall of 70.06% and MCC of 0.34. It is the first m5C predictor constructed using the unbalanced dataset, and the recall scores are increased by 19.82% and 59.23% for jackknife and independent tests compared with the latest tool RNAm5CPred, respectively. We demonstrate that the proposed predictor iRNA-m5C_NB outperforms other state-of-art models, which hopes to be an efficient and reliable method to identify m5C sites.
Highlights
M5C can be formed on carbon atom by the catalysis of RNA methyltransferase, where a methyl group is attached in the 5th position of the cytosine (C) ring [5]
Dou et al.: iRNA-m5C_NB: Novel Predictor to Identify RNA 5-Methylcytosine Sites Based on the Naive Bayes (NB) Classifier
We focused on the identification of RNA m5C sites in H. sapiens using the unbalanced dataset Met935 and Test1157
Summary
For iRNAm5C-PseDNC using the unbalanced dataset Met1900 (475 positive and 1425 negative samples), there are large amount of redundant sequences with the accuracy and MCC achieve 92.37% and 0.79. The high accuracies (more than 93%) were reported using the balanced dataset over jackknife test, it is an urgent need to construct the high-performance model using the unbalanced data based on the fact that the m5C sites is distributed unbalanced.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.