Abstract

BackgroundApplying supervised learning/classification techniques to epigenomic data may reveal properties that differentiate histone modifications. Previous analyses sought to classify nucleosomes containing histone H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s) or H2A.Z using human CD4+ T-cell chromatin immunoprecipitation sequencing (ChIP-Seq) data. However, these efforts only achieved modest accuracy with limited biological interpretation. Here, we investigate the impact of using appropriate data pre-processing —deduplication, normalization, and position- (peak-) finding to identify stable nucleosome positions — in conjunction with advanced classification algorithms, notably discriminatory motif feature selection and random forests. Performance assessments are based on accuracy and interpretative yield.ResultsWe achieved dramatically improved accuracy using histone modification features (99.0%; previous attempts, 68.3%) and DNA sequence features (94.1%; previous attempts, <60%). Furthermore, the algorithms elicited interpretable features that withstand permutation testing, including: the histone modifications H4K20me3 and H3K9me3, which are components of heterochromatin; and the motif TCCATT, which is part of the consensus sequence of satellite II and III DNA. Downstream analysis demonstrates that satellite II and III DNA in the human genome is occupied by stable nucleosomes containing H2A/H4R3me2s, H4K20me3, and/or H3K9me3, but not 18 other histone methylations. These results are consistent with the recent biochemical finding that H4R3me2s provides a binding site for the DNA methyltransferase (Dnmt3a) that methylates satellite II and III DNA.ConclusionsClassification algorithms applied to appropriately pre-processed ChIP-Seq data can accurately discriminate between histone modifications. Algorithms that facilitate interpretation, such as discriminatory motif feature selection, have the added potential to impart information about underlying biological mechanism.

Highlights

  • Applying supervised learning/classification techniques to epigenomic data may reveal properties that differentiate histone modifications

  • We downsampled H2A.Z nucleosomes to match the number of H2A/H4R3me2s nucleosomes for two reasons. This creates a balanced dataset for classification and yields accuracies directly comparable to those of [13]

  • To determine which features were “driving” the classification, we evaluated random forests feature importance by mean decrease in Gini index (MDG; Figure 1b; see Methods)

Read more

Summary

Introduction

Applying supervised learning/classification techniques to epigenomic data may reveal properties that differentiate histone modifications. Previous analyses sought to classify nucleosomes containing histone H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s) or H2A.Z using human CD4+ T-cell chromatin immunoprecipitation sequencing (ChIP-Seq) data. These efforts only achieved modest accuracy with limited biological interpretation. Many histone post-translational modifications contribute to establishing compacted, transcriptionally repressed heterochromatin (e.g., histone H3 lysine 9 trimethylation (H3K9me3)) or open, transcriptionally poised euchromatin (e.g., H3K4me3) [1,2]. It is currently unknown why so many modifications — on at least 60 histone residues [3] — are necessary [3,4]. Such discriminating properties remain largely unknown, as redundancy and enzyme promiscuity for non-histone targets have limited the amenability of histone modifications to genetic experimentation [6]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call