Abstract

Transcription factors regulate gene expression by binding regulatory DNA. Understanding the rules governing such binding is an essential step in describing the network of regulatory interactions, and its pathological alterations. We show that describing regulatory regions in terms of their profile of total binding affinities for transcription factors leads to increased predictive power compared to methods based on the identification of discrete binding sites. This applies both to the prediction of transcription factor binding as revealed by ChIP-seq experiments and to the prediction of gene expression through RNA-seq. Further significant improvements in predictive power are obtained when regulatory regions are defined based on chromatin states inferred from histone modification data.

Highlights

  • We sought to compare the predictive power of the total binding affinity (TBA) and occupancy at various cutoff values, as measured by the area under the Receiver Operating Characteristic (ROC) curve (AUC)

  • We used RNA-seq experiments performed within the ENCODE project on 9 human cell lines, and we modeled the dependence of gene expression on TBA profiles with log-linear regression: the logarithmic expression eg of gene g is given by eg 1⁄4 ciaðgiÞ þ b þ rg ð5Þ

  • In this work we have shown that a description of regulatory sequences in terms of Total Binding Affinity profiles is superior to methods based on the identification of discrete binding sites in predicting transcription factors (TFs) binding and gene expression

Read more

Summary

Methods

ChIP-seq peaks were downloaded from the hg UCSC Track “Transcription Factor ChIP-seq Uniform Peaks from ENCODE/Analysis”. All the R2 values we quote are adjusted for the number of predictors in the model In addition to these measures referring to the complete model, multivariate linear regression yields a coefficient and a significance value for each of the dependent variables (i.e. TBA for different PWMs), reflecting the influence of its values on the independent one (i.e. gene expression) and taking into account the global effect of all the other variables. In this context it is advisable to use a small set of well defined PWMs without redundancies to avoid to set up an overdetermined model and to be able to interpret the coefficients— for these models we used a reduced collection of 130 PWMs from Jaspar CORE. PWM data are obtained via the JASPAR2014 [21] or the MotifDB package [11]

Results
PWMs contribute additively to eg
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call