Abstract

DNA 5-hydroxymethylcytosine (5hmC), N6-methyladenine (6mA) and N4-methylcytosine (4mC) are three common kinds of DNA modifications and involve in various of biological processes. Accurate genome-wide identification of 5hmC, 6mA and 4mC sites is invaluable for better understanding their biological functions. Due to the labor-intensive and expensive nature of experimental methods for the genome-wide detection of 5hmC, 6mA and 4mC, it is urgent to develop computational methods for this aim. Keeping this in mind, the current study was devoted to construct a machine learning-based method to identify 5hmC, 6mA and 4mC in multiple species. We initially proposed using K-tuple nucleotide frequency component, nucleotide chemical property and nucleotide frequency, and mono-nucleotide binary encoding scheme to formulate positive and negative samples. Subsequently, the Random Forest was utilized to perform the identification of 5hmC, 6mA and 4mC sites. Results of five-fold cross-validation test and independent dataset test showed that the proposed method could produce the excellent generalization ability, suggesting that our proposed method is good at identifying 5hmC, 6mA and 4mC sites. For the convenience of retrieving 5hmC, 6mA and 4mC sites, a web-server called iDNA-MS was established for the proposed method, which is freely accessible at http://lin-group.cn/server/iDNA-MS.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call