Analyzing ChIP-chip Data Using Bioconductor

Joern Toedling,Wolfgang Huber,Fran Lewitter

doi:10.1371/journal.pcbi.1000227

Abstract

ChIP-chip, chromatin immunoprecipitation combined with DNA microarrays, is a widely used assay for DNA–protein binding and chromatin plasticity, which are of fundamental interest for the understanding of gene regulation. The interpretation of ChIP-chip data poses two computational challenges: first, what can be termed primary statistical analysis, which includes quality assessment, data normalization and transformation, and the calling of regions of interest; second, integrative bioinformatic analysis, which interprets the data in the context of existing genome annotation and of related experimental results obtained, for example, from other ChIP-chip or (m)RNA abundance microarray experiments. Both tasks rely heavily on visualization, which helps to explore the data as well as to present the analysis results. For the primary statistical analysis, some standardization is possible and desirable: commonly used experimental designs and microarray platforms allow the development of relatively standard workflows and statistical procedures. Most software available for ChIP-chip data analysis can be employed in such standardized approaches [1]–[6]. Yet even for primary analysis steps, it may be beneficial to adapt them to specific experiments, and hence it is desirable that software offers flexibility in the choice of algorithms for normalization, visualization, and identification of enriched regions. For the second task, integrative bioinformatic analysis, the datasets, questions, and applicable methods are diverse, and a degree of flexibility is needed that often can only be achieved in a programmable environment. In such an environment, users are not limited to predefined functions, such as the ones made available as “buttons” in a GUI, but can supply custom functions that are designed toward the analysis at hand. Bioconductor [7] is an open source and open development software project for the analysis and comprehension of genomic data, and it offers tools that cover a broad range of computational methods, visualizations, and experimental data types, and is designed to allow the construction of scalable, reproducible, and interoperable workflows. A consequence of the wide range of functionality of Bioconductor and its concurrency with research progress in biology and computational statistics is that using its tools can be daunting for a new user. Various books provide a good general introduction to R and Bioconductor (e.g., [8]–[10]), and most Bioconductor packages are accompanied by extensive documentation. This tutorial covers basic ChIP-chip data analysis with Bioconductor. Among the packages used are Ringo [5], biomaRt [11], and topGO [12]. We wrote this document in the Sweave [13] format, which combines explanatory text and the actual R source code used in this analysis [14]. Thus, the analysis can be reproduced by the reader. An R package ccTutorial that contains the data, the text, and code presented here, and supplementary text and code, is available from the Bioconductor Web site. > library(“Ringo”) > library(“biomaRt”) > library(“topGO”) > library(“ccTutorial”) Terminology. Reporters are the DNA sequences fixed to the microarray; they are designed to specifically hybridize with corresponding genomic fragments from the immunoprecipitate. A reporter has a unique identifier and a unique sequence, and it can appear in one or multiple features on the array surface [15]. The sample is the aliquot of immunoprecipitated or input DNA that is hybridized to the microarray. We shall call a genomic region apparently enriched by ChIP a ChIP-enriched region. The data. We consider a ChIP-chip dataset on a post-translational modification of histone protein H3, namely tri-methylation of its Lysine residue 4, in short H3K4me3. H3K4me3 has been associated with active transcription (e.g., [16],[17]). Here, enrichment for H3K4me3 was investigated in Mus musculus brain and heart cells. The microarray platform is a set of four arrays manufactured by NimbleGen containing 390 k reporters each. The reporters were designed to tile 32,482 selected regions of the Mus musculus genome (assembly mm5) with one base every 100 bp, with a different set of promoters represented on each of the four arrays ([18], Methods: Condensed array ChIP-chip). We obtained the data from the GEO repository [19] (accession {type:entrez-geo,attrs:{text:GSE7688,term_id:7688}}GSE7688).

Highlights

ChIP-chip, chromatin immunoprecipitation combined with DNA microarrays, is a widely used assay for DNA–protein binding and chromatin plasticity, which are of fundamental interest for the understanding of gene regulation.The interpretation of ChIP-chip data poses two computational challenges: first, what can be termed primary statistical analysis, which includes quality assessment, data normalization and transformation, and the calling of regions of interest; second, integrative bioinformatic analysis, which interprets the data in the context of existing genome annotation and of related experimental results obtained, for example, from other ChIP-chip or (m)RNA abundance microarray experiments.Both tasks rely heavily on visualization, which helps to explore the data as well as to present the analysis results
For the primary statistical analysis, some standardization is possible and desirable: commonly used experimental designs and microarray platforms allow the development of relatively standard workflows and statistical procedures
Even for primary analysis steps, it may be beneficial to adapt them to specific experiments, and it is desirable that software offers flexibility in the choice of algorithms for normalization, visualization, and identification of enriched regions