Abstract
DNA methylation of CpG islands plays a crucial role in the regulation of gene expression. More than half of all human promoters contain CpG islands with a tissue-specific methylation pattern in differentiated cells. Still today, the whole process of how DNA methyltransferases determine which region should be methylated is not completely revealed. There are many hypotheses of which genomic features are correlated to the epigenome that have not yet been evaluated. Furthermore, many explorative approaches of measuring DNA methylation are limited to a subset of the genome and thus, cannot be employed, e.g., for genome-wide biomarker prediction methods. In this study, we evaluated the correlation of genetic, epigenetic and hypothesis-driven features to DNA methylation of CpG islands. To this end, various binary classifiers were trained and evaluated by cross-validation on a dataset comprising DNA methylation data for 190 CpG islands in HEPG2, HEK293, fibroblasts and leukocytes. We achieved an accuracy of up to 91% with an MCC of 0.8 using ten-fold cross-validation and ten repetitions. With these models, we extended the existing dataset to the whole genome and thus, predicted the methylation landscape for the given cell types. The method used for these predictions is also validated on another external whole-genome dataset. Our results reveal features correlated to DNA methylation and confirm or disprove various hypotheses of DNA methylation related features. This study confirms correlations between DNA methylation and histone modifications, DNA structure, DNA sequence, genomic attributes and CpG island properties. Furthermore, the method has been validated on a genome-wide dataset from the ENCODE consortium. The developed software, as well as the predicted datasets and a web-service to compare methylation states of CpG islands are available at http://www.cogsys.cs.uni-tuebingen.de/software/dna-methylation/.
Highlights
DNA methylation of differentiated cells in mammals occurs almost exclusively at the C5 position in cytosine when it is immediately followed by a guanine [1]
Using the complete set of features, we applied the following machine learning algorithms to assess their performance on CpG island methylation prediction: (A) decision trees (J48), (B) naive Bayes, (C) k-nearest neighbor, (D) K* [33], (E) random decision forest, (F) and support vector machines with Gaussian radial basis function and (G) linear kernel
Using the most accurate classifier, we evaluated the suitability of all 15 feature classes for predicting DNA methylation of CpG islands
Summary
DNA methylation of differentiated cells in mammals occurs almost exclusively at the C5 position in cytosine when it is immediately followed by a guanine [1]. We extracted data for DNA methylation of CpG islands from this dataset for four cell types: leukocytes, fibroblasts, HEPG2 and Using the most accurate classifier, we evaluated the suitability of all 15 feature classes for predicting DNA methylation of CpG islands.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.