Abstract

DNA N4-methylcytosine(4mC) plays an important role in numerous biological functions and is a mechanism of particular epigenetic importance. Therefore, accurate identification of the 4mC sites in DNA sequences is necessary to understand the functional mechanism. Although some effective calculation tools have been proposed to identifying DNA 4mC sites, it is still challenging to improve identification accuracy and generalization ability. Therefore, there is a great need to build a computational tool to accurately identify the position of DNA 4mC sites. Hence, this study proposed a novel predictor XGB4mcPred, a predictor for the identification of 4mC sites trained using an extreme gradient boosting algorithm (XGBoost) and DNA sequence information. Firstly, we used the One-Hot encoding on adjacent and spaced nucleotides, dinucleotides, and trinucleotides of the original 4mC site sequences as feature vectors. Then, the importance values of the feature vectors pre-trained by the XGBoost algorithm were used as a threshold to filter redundant features, resulting in a significant improvement in the identification accuracy of the constructed XGB4mcPred predictor to identify 4mC sites. The analysis shows that there is a clear preference for nucleotide sequences between 4mC sites and non-4mC site sequences in six datasets from multiple species, and the optimized features can better distinguish 4mC sites from non-4mC sites. The experimental results of cross-validation and independent tests from six different species show that our proposed predictor XGB4mcPred significantly outperformed other state-of-the-art predictors and was improved to varying degrees compared with other state-of-the-art predictors. Additionally, the user-friendly webserver we used to developed the XGB4mcPred predictor was made freely accessible.

Highlights

  • DNA methylation is the process of adding methyl groups to specific regions of DNA and leads to genetic changes in gene expression [1], and this process can regulate gene expression and shutdown without altering the nucleotide sequence

  • This study proposed a novel predictor named XGB4mcPred based on a One-Hot encoding genomic sequence information

  • One-Hot encoding for multiple types of sequences as proposed in this study, followed positive, false positive, true negative, and false negative, it is often seen as a measure of balance

Read more

Summary

Introduction

DNA methylation is the process of adding methyl groups to specific regions of DNA and leads to genetic changes in gene expression [1], and this process can regulate gene expression and shutdown without altering the nucleotide sequence. The activity of genes can be regulated by controlling the process of DNA methylation, turning off certain gene activity or inducing the reactivation and expression of certain genes [2,3,4,5]. It is closely relevant for the study of cancer, aging, or for the study of regulating virulence and antibiotic resistance in prokaryotes [6,7,8]. The DNA 5mC sites, 6mA sites, and 4mC sites are the most common methylation processes, which are widely found in eukaryotes and prokaryotes [9,10,11,12]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.