Abstract

One of the most crucial problems with genome-wide experimental analysis is how to extract meaningful biological phenomena from the resulting large data sets. Here, we present modeling and prediction techniques that are applied to genome-wide identification of in vivo protein-DNA binding sites from ChIP-based data sets. We develop a simple mixture probabilistic model of occurrence of non-specific and specific TF-DNA binding events for transcription factor binding to any site in the genome. We calculated the statistical significance of specific and non-specific random binding events using Kolmogorov-Waring and exponential functions, respectively. The binding events in the chromosome regions associated with non-specific, non-random binding loci were also identified and filtered out. The mixture model fits equally well to five different TFs (ERE, CREB, STAT1, Nanog, Oct4) data provided by ChIP-PET, SACO, and ChIP-Seq methods included in this study. We present a uniform methodology for estimating specificity, total number of binding sites, and sensitivity of data sets detected by these ChIP-based genome-wide experimental systems. We demonstrate strong heterogeneity of specific TF-DNA binding sites in terms of their avidity and by correlation between observed relative binding avidity of specific TF-DNA binding site and the level of mRNA transcription of the nearest gene target. Finally, we conclude that the sensitivity problem has not been resolved by current ChIP-based methods, including ChIP-Seq.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call