Abstract

DNA methylation plays a significant role in transcriptional regulation by repressing activity. Change of the DNA methylation level is an important factor affecting the expression of target genes and downstream phenotypes. Because current experimental technologies can only assay a small proportion of CpG sites in the human genome, it is urgent to develop reliable computational models for predicting genome-wide DNA methylation. Here, we proposed a novel algorithm that accurately extracted sequence complexity features (seven features) and developed a support-vector-machine-based prediction model with integration of the reported DNA composition features (trinucleotide frequency and GC content, 65 features) by utilizing the methylation profiles of embryonic stem cells in human. The prediction results from 22 human chromosomes with size-varied windows showed that the 600-bp window achieved the best average accuracy of 94.7%. Moreover, comparisons with two existing methods further showed the superiority of our model, and cross-species predictions on mouse data also demonstrated that our model has certain generalization ability. Finally, a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. Accordingly, we believe that our novel model will be useful and reliable in predicting DNA methylation.

Highlights

  • The DNA sequence of the human genome, which carries genetic information, is almost invariant in various human cells, the epigenetic features of each cell show great differences, leading to distinguishable gene expression patterns and cell-type specificities [1]

  • The average methylation levels in specific regions are consistent with other signals and modifications that are related to transcriptional regulation, such as transcription factor binding sites (TFBSs), DNase I hypersensitive sites (DHSs) and various histone modifications [1,9]

  • Our prediction method has the following advantages compared with current classifiers: (a) a group of novel features called “sequence complexity” were developed, and subsequent analysis confirmed that these new features played significant roles for predictions; (b) by integrating the fundamental features (DNA composition), the prediction model achieved satisfactory results; (c) all the features we used were only extracted from the primary DNA sequence of the human genome without additional experiments, and comparisons with previous works showed the superiority of our method; and (d) a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites

Read more

Summary

Introduction

The DNA sequence of the human genome, which carries genetic information, is almost invariant in various human cells, the epigenetic features of each cell show great differences, leading to distinguishable gene expression patterns and cell-type specificities [1]. With current classifiers: (a) a group of novel features called “sequence complexity” were developed, and subsequent analysis confirmed that these new features played significant roles for predictions; (b) by integrating the fundamental features (DNA composition), the prediction model achieved satisfactory results; (c) all the features we used were only extracted from the primary DNA sequence of the human genome without additional experiments, and comparisons with previous works showed the superiority of our method; and (d) a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. The current study will be organized following these steps one-by-one

DNA Methylation Dataset and Data Preprocessing
DNA Methylation Pattern
Comparison with Different Classifiers
Feature Importance
ComTpaorpis2o4n IwmitphoOrttahnetr FEexaistutirnegsMethods24
Comparison with Other Existing Methods
Cross-Species Prediction
Prediction of DNA Methylation Profiles across the Whole Human Genome
DDiissccuussssiioonn
Predicting window size
Features for Prediction
Support Vector Machine
Evaluating Indicator
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.