Abstract

RNA 5-hydroxymethylcytosine (5hmC) modification is the basis of the translation of genetic information and the biological evolution. The study of its distribution in transcriptome is fundamentally crucial to reveal the biological significance of 5hmC. Biochemical experiments can use a variety of sequencing-based technologies to achieve high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Therefore, it is urgent to develop more effective and feasible computational methods. In this paper, a novel and powerful model called iR5hmcSC is designed for identifying 5hmC. Firstly, we extract the different features by K-mer, Pseudo Structure Status Composition and One-Hot encoding. Subsequently, the combination of chi-square test and logistic regression is utilized as the feature selection method to select the optimal feature sets. And then stacking learning, an ensemble learning method including random forest (RF), extra trees (EX), AdaBoost (Ada), gradient boosting decision tree (GBDT), and support vector machine (SVM), is used to recognize 5hmC and non-5hmC. Finally, 10-fold cross-validation test is performed to evaluate the model. The accuracy reaches 85.27% and 79.92% on benchmark dataset and independent dataset, respectively. The result is better than the state-of-the-art methods, which indicates that our model is a feasible tool to identify 5hmC. The datasets and source code are freely available at https://github.com/HongyanShi026/iR5hmcSC.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.