Vocabulon: a dictionary model approach for reconstruction and localization of transcription factor binding sites

Chiara Sabatti,Lars Rohlin,James C Liao,Kenneth Lange

doi:10.1093/bioinformatics/bti083

Abstract

Gene expression arrays enable measurements of transcription values for a large number or all genes in the genome. In order to better interpret these results and to use them to reconstruct transcription networks, information on location of binding sites for regulatory proteins in the entire genome is needed. In particular, this represents an open problem in Escherichia coli. We describe the first implementation of dictionary-style models to the study of transcription factors binding sites in an entire genome. Vocabulon's unique feature is that it can both reconstruct binding sites characterized by unknown motifs and impute locations of known binding sites in long sequences by simultaneous search. On one hand, the dictionary model specifies a probability for the entire sequence taking simultaneously into account all the possible binding sites. This greatly reduces the number of false positives. On the other hand, the possibility of refining motif description, as an increasing number of binding sites are identified, augments the sensitivity of the method. We illustrate these properties with examples in E.coli. The results of gene expression arrays are used both to guide the search and corroborate it.

Full Text