Identification of DNA N4-methylcytosine sites based on multi-source features and gradient boosting decision tree

Shengli Zhang,Yingying Yao,Jiesheng Wang,Yunyun Liang

doi:10.1016/j.ab.2022.114746

Abstract

N4-methylcytosine (4 mC) is an important and common methylation which widely exists in prokaryotes. It plays a crucial role in correcting DNA replication errors and protecting host DNA against degradation by restrictive enzymes. Hence, the accurate identification for 4 mC sites is greatly significant for understanding biological functions and treating gene diseases. In this paper, a novel model is designed for identifying 4 mC sites. Firstly, we extract features from original sequences by multi-source feature representation methods, which are mono-nucleotide binary and k-mer frequency, dinucleotide binary and position-specific frequency, ring-function-hydrogen-chemical properties, dinucleotide-based DNA properties and trinucleotide-based DNA properties. Subsequently, gradient boosting decision tree is applied to select the optimal feature set and remove redundant information. Finally, support vector machine is employed to predict 4 mC or non-4mC sites. The accuracies of six datasets reach 0.851, 0.859, 0.801, 0.87, 0.859 and 0.901, respectively, which are superior to previous prediction methods. Therefore, the results show that our predictor is a feasible and effective tool for identifying 4 mC sites. Furthermore, an online web server is established at http://dnan4c.zhanglab.site.

Full Text