Abstract

BackgroundIn transcription factor binding site discovery, the true width of the motif to be discovered is generally not known a priori. The ability to compute the most likely width of a motif is therefore a highly desirable property for motif discovery algorithms. However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths. The complexity of the problem is increased as the discovered model at the true motif width need not be the most statistically significant in a set of candidate motif models. Further, the core motif discovery algorithm used cannot guarantee to return the best possible result at each candidate width.ResultsWe present MCOIN, a novel heuristic for automatically determining transcription factor binding site motif width, based on motif containment and information content. Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error. MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.ConclusionsWe demonstrate the performance of MCOIN as part of a deterministic motif discovery algorithm and conclude that MCOIN outperforms current methods for determining motif width.

Highlights

  • Recent advances in biology have led to a huge increase in the amount of data available for study

  • Based on tests with previously characterised prokaryotic transcription factor binding site (TFBS) motifs, we show that motif containment and information content (MCOIN) outperforms the E-value of the resulting multiple alignment as a predictor of motif width, using mean absolute error

  • Results of tests on two data collections of previously characterised prokaryotic motifs show that MCOIN outperforms the E-value of the resulting multiple alignment as a predictor of motif width, using mean absolute error and root mean squared error

Read more

Summary

Introduction

Recent advances in biology have led to a huge increase in the amount of data available for study. Determining the width of a novel TFBS motif is a desirable property for motif discovery algorithms since the true motif width is generally not known a priori. In transcription factor binding site discovery, the true width of the motif to be discovered is generally not known a priori. The ability to compute the most likely width of a motif is a highly desirable property for motif discovery algorithms. This is a challenging computational problem as a result of changing model dimensionality at changing motif widths. The complexity of the problem is increased as the discovered model at the true motif width need not be the most statistically significant in a set of candidate motif models. The core motif discovery algorithm used cannot guarantee to return the best possible result at each candidate width

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call