Abstract

DNA methylation is an inheritable chemical modification of cytosine, and represents one of the most important epigenetic events. Computational prediction of the DNA methylation status can be employed to speed up the genome-wide methylation profiling, and to identify the key features that are correlated with various methylation patterns. Here, we develop CpGIMethPred, the support vector machine-based models to predict the methylation status of the CpG islands in the human genome under normal conditions. The features for prediction include those that have been previously demonstrated effective (CpG island specific attributes, DNA sequence composition patterns, DNA structure patterns, distribution patterns of conserved transcription factor binding sites and conserved elements, and histone methylation status) as well as those that have not been extensively explored but are likely to contribute additional information from a biological point of view (nucleosome positioning propensities, gene functions, and histone acetylation status). Statistical tests are performed to identify the features that are significantly correlated with the methylation status of the CpG islands, and principal component analysis is then performed to decorrelate the selected features. Data from the Human Epigenome Project (HEP) are used to train, validate and test the predictive models. Specifically, the models are trained and validated by using the DNA methylation data obtained in the CD4 lymphocytes, and are then tested for generalizability using the DNA methylation data obtained in the other 11 normal tissues and cell types. Our experiments have shown that (1) an eight-dimensional feature space that is selected via the principal component analysis and that combines all categories of information is effective for predicting the CpG island methylation status, (2) by incorporating the information regarding the nucleosome positioning, gene functions, and histone acetylation, the models can achieve higher specificity and accuracy than the existing models while maintaining a comparable sensitivity measure, (3) the histone modification (methylation and acetylation) information contributes significantly to the prediction, without which the performance of the models deteriorate, and, (4) the predictive models generalize well to different tissues and cell types. The developed program CpGIMethPred is freely available at http://users.ece.gatech.edu/~hzheng7/CGIMetPred.zip.

Highlights

  • Epigenetics refers to structural adaptation of chromosomal regions to register, signal or perpetuate altered activity states [1]

  • There are 101 methylated and 368 unmethylated CpG islands for the CD4 lymphocytes, which are used for training and validating the predictive models, while the CpG islands in the other tissues or cell types are used for generalizability testing

  • We perform a two-step feature selection procedure, where the statistical test is used to select those features that are highly correlated with the methylation status of CpG islands, and principal component analysis (PCA) is used to minimize the redundancy in the features

Read more

Summary

Background

Epigenetics refers to structural adaptation of chromosomal regions to register, signal or perpetuate altered activity states [1]. In addition to DNA composition features, Fang et al used the distribution of the repetitive element AluY as well as the distribution of TFBSs for predicting the methylation status of CpG rich segments, and reported an ~84% specificity and ~84% sensitivity on the human brain data set using a support vector machine-based classifier [3]. In light of the reported interaction between histone modification enzymes and DNA methylases [16,17], Fan et al found four histone methylation marks that are highly correlated with the DNA methylation status of CpG islands, and incorporated these histone methylation marks into the prediction of the methylation status of CpG islands Compared to those methods without histone methylation information [13,11], the augmented features led to improved performance: a ~94% specificity and ~74% sensitivity on the CD4 T cell data set using a support vector machine-based classifier [13]. There are 101 methylated and 368 unmethylated CpG islands for the CD4 lymphocytes, which are used for training and validating the predictive models, while the CpG islands in the other tissues or cell types are used for generalizability testing

Methods
Results and discussions
Method
Conclusions and future works
Procedure
Bird A
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.