XGB5hmC: Identifier based on XGB model for RNA 5-hydroxymethylcytosine detection

Agung Surya Wibowo,Hilal Tayara,Kil To Chong

doi:10.1016/j.chemolab.2023.104847

Abstract

One of the problems in bioinformatics that artificial intelligence can solve is RNA 5-hydroxymethylcytosine (5hmC) site detection, which has become increasingly important because of its benefits, such as cost savings in labor, materials, and time consumption. To create a reliable identifier, performance results must be as high as possible. In this study, we developed XGB5hmC, a high-performance identifier of RNA 5hmC. We use extreme gradient boosting (XGB) as the best model. In addition, we investigated other models, such as random forest (RF), ada boosting (AB), and gradient boosting (GB). First, IlearnPlus was used to run 15 different machine learning models using 35 different descriptors to select the best descriptors. Then, it was decided that the composition of k-spaced nucleic acid pairs (CKSNAP), pseudo-K-tuple nucleotide composition (PseKNC), and position-specific trinucleotide propensity single strand (PSTNPss) are the best descriptors. Subsequently, the features were combined and reduced in dimension using chi-squared test filtering. Using these filtered features and the XGB model, we obtained better performance than the state-of-the-art methods. The increases in accuracy, sensitivity, specificity, and MCC values were 11.43, 15.82, 8.94, and 24.58%, respectively. This implies that our model improved as a reliable identifier to detect 5hmC. All datasets and complete source codes can be accessed freely at https://github.com/asw1982/XGB5hmC.

Full Text