Abstract
A molecule called transcription factor usually binds to a set of promoter sequences of coexpressed genes. As a result, these promoter sequences contain some short substrings, or binding sites, with similar patterns. The motif discovering problem is to find these similar patterns and motifs in a set of sequences. Most existing algorithms find the motifs based on strong-signal sequences only (i.e., those containing binding sites very similar to the motif). In this paper, we use a probability matrix to represent a motif to calculate the minimum total number of binding sites required to be in the input dataset in order to confirm that the discovered motifs are not artifacts. Next, we introduce a more general and realistic energy-based model, which considers all sequences with varying degrees of binding strength to the transcription factors (as measured experimentally). By treating sequences with varying degrees of binding strength, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding Algorithm) to find the motif, which can handle sequences ranging from those that contain more than one binding site to those that contain none. EBMF can find motifs for datasets that do not even have the required minimum number of binding sites as previously derived. EBMF compares favorably with common motif-finding programs AlignACE and MEME. In particular, for some simulated and real datasets, EBMF finds the motif when both AlignACE and MEME fail to do so.
Highlights
One great challenge in molecular biology is to understand the regulation of gene expression - the process by which a segment of DNA is decoded to form a protein
According to the results by Buhler and Tompa [Buhler 2002], these sequences are much less than the minimum number of input sequence required, which is 4, and it should be theoretically impossible to find the motif for this input set (We set n = 787, t = 3, l = 13 and d = 2). We tested this input set on two common motif-finding programs, AlignACE [Hughes 2000, Roth 1998] and MEME [Bailey 1994], which are based on the strong-signal model
Let m be the total number of sequences, n be the length of each sequence, t be the number of sequences with binding sites and B∗ be the number of binding sites in the t sequences, we generated the simulated data as follow
Summary
One great challenge in molecular biology is to understand the regulation of gene expression - the process by which a segment of DNA is decoded to form a protein. An mRNA molecule is formed by copying a gene from the DNA. The mRNA is decoded to produce a protein. To start the transcription process for a particular gene, one or more corresponding proteins, called transcription factors, have to bind to several specific regions, called binding sites, in the promoter region of the gene. A transcription factor can bind to multiple binding sites, but these sites typically have similar length (usually about 8 to 20 bp) and a common DNA sequence pattern. The common patterns for their corresponding binding sites, referred to as the motifs, are still unknown. Many laboratory-based methods for motif identification have been developed, these experimental methods are both expensive and time-consuming
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.