Abstract

DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.

Highlights

  • DNA N6-methyladenine (6mA) is a methyl modification at the sixth position of the adenine ring, which was discovered by Vanyushin et al (1968). 6mA is widely found in prokaryotes and eukaryotes (Fu et al, 2015; Greer et al, 2015; Zhang et al, 2015)

  • DNA sequences were represented by pseudo-k-tuple nucleotide composition incorporating the physicochemical properties of nucleotides, and the sequences were classified by a support vector machine (SVM)

  • DNA sequences were encoded by nucleotide positionbased feature descriptors, and these sequences were classified by an ensemble classifier integrating random forest, linear discriminant analysis, multi-layer perceptron, stochastic gradient descent, and extreme gradient boosting

Read more

Summary

INTRODUCTION

DNA N6-methyladenine (6mA) is a methyl modification at the sixth position of the adenine ring, which was discovered by Vanyushin et al (1968). 6mA is widely found in prokaryotes and eukaryotes (Fu et al, 2015; Greer et al, 2015; Zhang et al, 2015). IDNA6mA-PseKNC (Feng et al, 2019) was proposed to detect 6mA sites in the mouse genome In this model, DNA sequences were represented by pseudo-k-tuple nucleotide composition incorporating the physicochemical properties of nucleotides, and the sequences were classified by a support vector machine (SVM). Metai6mA has achieved encouraging results in intraspecies, it still has room for improvement in interspecific To solve this problem, a novel classification model i6mA-vote was developed based on an ensemble learning strategy. A novel classification model i6mA-vote was developed based on an ensemble learning strategy In this model, DNA sequences were encoded by nucleotide positionbased feature descriptors, and these sequences were classified by an ensemble classifier integrating random forest, linear discriminant analysis, multi-layer perceptron, stochastic gradient descent, and extreme gradient boosting. The details of i6mA-vote will be introduced

MATERIALS AND METHODS
A DNA sequence is usually composed of four standard nucleotide symbols
RESULTS AND DISCUSSION
DATA AVAILABILITY STATEMENT
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call