Abstract

Cellular identity and behavior is controlled by complex gene regulatory networks. Transcription factors (TFs) bind to specific DNA sequences to regulate the transcription of their target genes. On the basis of these TF motifs in cis-regulatory elements we can model the influence of TFs on gene expression. In such models of TF motif activity the data is usually modeled assuming a linear relationship between the motif activity and the gene expression level. A commonly used method to model motif influence is based on Ridge Regression. One important assumption of linear regression is the independence between samples. However, if samples are generated from the same cell line, tissue, or other biological source, this assumption may be invalid. This same assumption of independence is also applied to different yet similar experimental conditions, which may also be inappropriate. In theory, the independence assumption between samples could lead to loss in signal detection. Here we investigate whether a Bayesian model that allows for correlations results in more accurate inference of motif activities. We extend the Ridge Regression to a Bayesian Linear Mixed Model, which allows us to model dependence between different samples. In a simulation study, we investigate the differences between the two model assumptions. We show that our Bayesian Linear Mixed Model implementation outperforms Ridge Regression in a simulation scenario where the noise, which is the signal that can not be explained by TF motifs, is uncorrelated. However, we demonstrate that there is no such gain in performance if the noise has a similar covariance structure over samples as the signal that can be explained by motifs. We give a mathematical explanation to why this is the case. Using four representative real datasets we show that at most ∼​40% of the signal is explained by motifs using the linear model. With these data there is no advantage to using the Bayesian Linear Mixed Model, due to the similarity of the covariance structure. The project implementation is available at https://github.com/Sim19/SimGEXPwMotifs.

Highlights

  • Cell type-specific gene expression programs are mainly driven by differential expression and binding of transcription factors (TFs)

  • We extend the Ridge Regression to a Bayesian Linear Mixed Model, which allows us to model dependence between different samples

  • We show that our Bayesian Linear Mixed Model implementation outperforms Ridge Regression in a simulation scenario where the noise, which is the signal that can not be explained by TF motifs, is uncorrelated

Read more

Summary

Introduction

Cell type-specific gene expression programs are mainly driven by differential expression and binding of transcription factors (TFs). The human genome contains *1, 600 TFs, which represent 8% of all genes [1]. These proteins bind DNA in a sequence-specific manner and typically have a 1000-fold or greater preference for their cognate binding site as compared to other sequences [2]. By binding to cis-regulatory regions, i.e. promoters and enhancers, they can control the chromatin environment and the expression of downstream target genes [1]. It is of great importance to understand the mechanisms of gene regulation driven by TFs

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call