Computational analysis of promoters and DNA-protein interactions

Andrija Tomović

doi:10.5451/unibas-005042288

Abstract

The investigation of promoter activity and DNA-protein interactions is very important for understanding many crucial cellular processes, including transcription, recombination and replication. Promoter activity and DNA-protein interactions can be studied in the lab (in vitro or in vivo) or using computational methods (in silico). Computational approaches for analysing promoters and DNA-protein interactions have become more powerful as more and more complete genome sequences, 3D structural data, and high-throughput data (such as ChIP-chip and expression data) have become available. Modern scientific research into promoters and DNA-protein interactions represents a high level of cooperation between computational and laboratorial methods. This thesis covers several aspects of the computational analysis of promoters and DNAprotein interactions: analysis of transcription factor binding sites (investigating position dependencies in transcription factor binding sties); computational prediction of transcription factor binding sites (a new scanning method for the in silico prediction of transcription factor binding sites is described); computational analysis of crystal structures of DNA-protein interactions (multiple proteins bound to DNA); and computational predictions of transcription factor co-operations (investigating dependencies between transcription factors in human, mouse and rat genomes, and a new method of in silico prediction of cis-regulatory motifs and transcription start sites is described). In addition, this thesis reports how one statistical method for the analysis of transcription factor binding sites can be used for estimating the quality of multiple sequence alignments. The main finding reported in this thesis is that it is wrong to assume, a priori, that positions in transcription factor binding sites are all either independent or dependent on one another. Position dependencies should be tested using rigorous statistical methods on a case-by-case basis. When dependencies are detected, they can be modelled in a very simple way, which doesn’t require complex mathematical tools with a lot of parameters and more data. An example of such a model, including a web-based implementation of the algorithm, is reported in this thesis. It has also been shown that the conformational energy (indirect readout) of DNA in complexes with transcription factors which have dependent positions in their binding sites is significant ly higher than in those with transcription factors which do not have dependent positions in their binding sites. The structural analysis of multiple protein-DNA interactions showed that the formation of interactions between multiple proteins and DNA results in a decrease in proteinprotein affinity and an increase in protein-DNA affinity, with a net gain in overall stability of complexes where multiple proteins are bound to DNA. This effect is clearly important for modelling transcription factor co-operativity. In addition, the physical overlap of two factors does not simply relate to the region on the DNA where the binding site is found. Two factors may lie very close together but possibly not physically overlap because their side-chains can interlink with one another. In this way, it is possible to find a large overlap between two transcription factor binding sites, but from a 3D perspective it is still possible for both factors to bind simultaneously. It may also be that one transcription factor binds to the minor and another to the major groove of DNA. That information is also useful for modelling transcription factor co-operativity. Moreover, this thesis reports the results from a computational prediction of dependencies (co-operativities) between transcription factors which usually act together in gene regulation in human, mouse and rat genomes. It is shown that that the computational analysis of transcription factor site dependencies is a valuable complement to experimental approaches for discovering transcription regulatory interactions and networks. Scanning promoter sequences with dependent groups of transcription factor binding sites improve the quality of transcription factor predictions. Finally, it has been demonstrated that modelling transcription factor co-operativities improves the quality of transcription start site predictions. For three genes (ctmp, gap-43 and ngfrap) in-vivo validation of the predicted transcription start sites is performed. Finally, the Bayesian method for the detection of dependencies between positions in transcription factor binding sites can easily be converted into a method for estimating the quality of multiple sequence alignments. That method is simple, linear complexity, which is easy to implement and which performs better than other state-of-the-art methods which are more complex.

Full Text