Abstract

DNA N6-adenine methylation (6mA) is an epigenetic modification in prokaryotes and eukaryotes. Identifying 6mA sites in rice genome is important in rice epigenetics and breeding, but non-random distribution and biological functions of these sites remain unclear. Several machine-learning tools can identify 6mA sites but show limited prediction accuracy, which limits their usability in epigenetic research. Here, we developed a novel computational predictor, called the Sequence-based DNA N6-methyladenine predictor (SDM6A), which is a two-layer ensemble approach for identifying 6mA sites in the rice genome. Unlike existing methods, which are based on single models with basic features, SDM6A explores various features, and five encoding methods were identified as appropriate for this problem. Subsequently, an optimal feature set was identified from encodings, and corresponding models were developed individually using support vector machine and extremely randomized tree. First, all five single models were integrated via ensemble approach to define the class for each classifier. Second, two classifiers were integrated to generate a final prediction. SDM6A achieved robust performance on cross-validation and independent evaluation, with average accuracy and Matthews correlation coefficient (MCC) of 88.2% and 0.764, respectively. Corresponding metrics were 4.7%–11.0% and 2.3%–5.5% higher than those of existing methods, respectively. A user-friendly, publicly accessible web server (http://thegleelab.org/SDM6A) was implemented to predict novel putative 6mA sites in rice genome.

Highlights

  • Recent breakthroughs in the fields of molecular biology and genomics have made it possible to determine the functional significance of DNA modifications

  • Feature Encodings We evaluated the performance of five different feature encodings using four different Machine learning (ML) classifiers

  • The improved performance, shown by SDM6A, may be explained as follows: (1) because previous feature extraction methods were relatively simple, we systematically and comprehensively explored different types of feature encodings and determined that five feature encodings significantly contribute to prediction of 6mA sites; (2) we optimized each feature encoding and individually integrated them via an ensemble strategy for support vector machine (SVM) and ERT; and (3) we developed an ensemble model by integrating SVM and ERT, which further improved robustness of the model

Read more

Summary

Introduction

Recent breakthroughs in the fields of molecular biology and genomics have made it possible to determine the functional significance of DNA modifications. 6mA sites have not been extensively investigated because of their non-uniform distribution across the genome. The distribution and function of 6mA modifications has been studied in unicellular eukaryotes; until recently, the nature of these alterations in multicellular eukaryotes was unclear.[3] Several new studies have shed light on the distribution and contrasting regulatory functions of 6mA modifications in multicellular eukaryotes, such as Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Mus musculus, Tetrahymena, and Xenopus laevis.[4,5,6,7,8,9,10]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.