Abstract

BackgroundStatistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.ResultsTo address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.ConclusionsThe traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users.

Highlights

  • Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics

  • We may observe a ChIP-seq data set that does not contain one clearly overrepresented sequence motif, and including such a data set into a systematic evaluation of intra-motif dependencies could and often would yield misleading results

  • We have investigated the prevalence of intra-motif dependencies in transcription factor binding sites as well as the task of utilizing them for improving de novo motif discovery from ChIP-seq data

Read more

Summary

Introduction

Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The PWM model assumes statistical independence among all nucleotides in the motif and corresponds to the biophysical assumption that binding affinities of nucleotides within a DNA binding site to the corresponding DNA-binding protein are additive [21]. Due to this independence assumption, the PWM model requires comparatively few parameters that can be robustly estimated even from few and noisy training sequences, but there is an ongoing discussion about its capability of accurately modeling protein-DNA interaction [21,22,23,24,25,26,27]. With the rise of high-throughput techniques such as ChIP-seq [28], the size and quality of available training data sets increases, which in turn makes the use of more complex models promising

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call