Abstract
BackgroundMotif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially.ResultsWe propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery.ConclusionWe conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
Highlights
Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences
In this article we propose to integrate position uniformity tests and over-representation tests based on Markov models to improve the posterior classification of the motifs and better assess their biological significance
We propose that the integration of position uniformity tests and over-representation tests can be used to improve the accuracy of the classification of motifs found by combinatorial motif finders
Summary
Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. The computational analysis of DNA sequences represents a major endeavor in the post-genomic era. The increasing number of whole-genome sequencing projects has provided an enormous amount of information which leads to the need of new tools and string processing algorithms to analyze and classify the obtained sequences [1]. In this regard, the study of short functional DNA segments, such as transcriptional factor binding sites, has emerged as an important effort to understand key control mechanisms. DNA motifs can be represented in a number of different ways. Our approach is independent of the way motifs are modeled, since it requires only the list of occurrences of motifs, something that can be obtained from any motif representation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.