Parametric bootstrapping for biological sequence motifs.

Patrick K O’Neill,Ivan Erill

doi:10.1186/s12859-016-1246-8

Abstract

BackgroundBiological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random sampling problems. In this light, the task is to determine for a given motif whether a certain feature of interest is statistically unusual among relevantly similar alternatives. Despite the generality of this framework, its use has been frustrated by the difficulties of defining an appropriate reference class of motifs for comparison and of sampling from it effectively.ResultsWe define two distributions over the space of all motifs of given dimension. The first is the maximum entropy distribution subject to mean information content, and the second is the truncated uniform distribution over all motifs having information content within a given interval. We derive exact sampling algorithms for each. As a proof of concept, we employ these sampling methods to analyze a broad collection of prokaryotic and eukaryotic transcription factor binding site motifs. In addition to positional information content, we consider the informational Gini coefficient of the motif, a measure of the degree to which information is evenly distributed throughout a motif’s positions. We find that both prokaryotic and eukaryotic motifs tend to exhibit higher informational Gini coefficients (IGC) than would be expected by chance under either reference distribution. As a second application, we apply maximum entropy sampling to the motif p-value problem and use it to give elementary derivations of two new estimators.ConclusionsDespite the historical centrality of biological sequence motif analysis, this study constitutes to our knowledge the first use of principled null hypotheses for sequence motifs given information content. Through their use, we are able to characterize for the first time differerences in global motif statistics between biological motifs and their null distributions. In particular, we observe that biological sequence motifs show an unusual distribution of IGC, presumably due to biochemical constraints on the mechanisms of direct read-out.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1246-8) contains supplementary material, which is available to authorized users.

Highlights

Biological sequence motifs drive the specific interactions of proteins and nucleic acids
Our results show that the degree of disparity in information content across positions, as measured by informational Gini coefficient (IGC), is significantly higher in transcription factor binding motifs than in null ensembles with matched IC, and that higher IGC is not consistently associated with motifs bound by multimeric TFs
The length of the motif was fixed at 10 bp; the number of sites in the motif N varied from 20 to 200; and the desired IC varied between 5 and 15 bits with a tolerance of ±0.1 bits for the Truncated uniform distribution given information content (TU) distribution

Summary

Introduction

Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Many biological processes rely on the recognition of specific sequence patterns, or motifs, that define specific interactions between biological molecules [2]. The ubiquity of these motifs has led to the proliferation of a vast array of bioinformatics algorithms dedicated to the discovery and study of these sequence elements and their evolution [2,3,4,5,6,7,8]. This binding relies on the specific recognition of short (5–30 bp) DNA sequence motifs by the transcription factor (TF) and, the discovery and characterization of TF-binding motifs is essential to our understanding of transcriptional gene regulation [5, 9, 10].

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 6, 2016
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Parametric bootstrapping for biological sequence motifs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data
Bertrand R Huber ... Martha L Bulyk
BMC Bioinformatics | VOL. 7
Bertrand R Huber, et. al.Bertrand R Huber ... Martha L Bulyk
27 Apr 2006
BMC Bioinformatics | VOL. 7

MCOIN: a novel heuristic for determining TFBS motif width
...
-
, et. al. ...
18 Jun 2013
18 Jun 2013

Efficient representation and P-value computation for high-order Markov motifs
Paulo G S Da Fonseca ... Katia S Guimarães
Bioinformatics | VOL. 24
Paulo G S Da Fonseca, et. al.Paulo G S Da Fonseca ... Katia S Guimarães
09 Aug 2008
Bioinformatics | VOL. 24

Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm.
I Rigoutsos ... A Floratos
Bioinformatics | VOL. 14
I Rigoutsos, et. al.I Rigoutsos ... A Floratos
01 Jan 1998
Bioinformatics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parametric bootstrapping for biological sequence motifs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics