Abstract

The human genome contains between 20000 and 25000 genes, which code for proteins, and RNA genes of different function, but approximately 95% of the human genome are presumably not transcribed. The information governing the structural organization and transcriptional regulation of the genome is hidden in this sequence fraction. To be able to understand and to model the gene regulation, first of all one needs to know the localizations of regulatory elements. Certain proteins, the so-called transcription factors (TFs), bind to such regulatory elements, which therefore are also called transcription factor binding sites (TFBSs), and affect the transcription efficiency of the regulated gene. TFBSs can be identified experimentally by different methods, but these are time-consuming and cost-intensive. Another possibility for the identification of TFBSs is the application of bioinformatic methods. However regulatory elements are short and degenerated, which increases the probability that a certain sequence pattern is found by chance and hampers the reliable bioinformatic detection of TFBSs. In order to improve the signal-to-noise ratio of this search, the sequence conservation between orthologous non-coding sequences of two or several species is frequently used. This so-called phylogenetic footprinting is based on the assumption that functional elements in non-coding regions are under a higher selective pressure during evolution than non-functional regions and therefore characterized by an increased conservation in a sequence alignment.In the context of this work the approach of phylogenetic footprinting has been evaluated. For this purpose it has been examined to what extent experimentally verified TFBSs can be detected by sequence comparisons between human and mouse, rat, dog as well as cow, in order to calibrate and assess the approach. For the success of phylogenetic footprinting it is crucial to ensure the orthologous relationship between the sequences compared. A procedure for the identification of orthologous sequences has been developed, which is independent of a potentially incorrect annotation of the gene structure, as orthologous sequences are located by a search for sequence homologies in the vicinity of annotated orthologous genes. Further a conservation criterion has been determined, which gives an optimal discrimination between known TFBSs and sequences, which hold no or no known function. The choice of the alignment algorithm has only a marginal influence on the obtained results, since human and rodent sequences exhibit a sufficiently high similarity, so that most alignment programs give similar results. The sequence conservation of TFBSs shows specific differences and varies strongly depending on the corresponding TF. Further the nucleotides, which have the highest contribution to the specificity of a certain DNA-TF interaction, are often higher conserved than the remaining nucleotides of a TFBS. Clear differences in the sequence conservation of TFBSs can also be seen in dependence of the function of the regulated gene. Generally pairwise sequence comparisons between human and mouse or rat prove to be superior to those between human and dog or cow for the identification TFBSs.A goal of this work has been to improve the prediction of TFBSs using the information obtained from species comparisons. A standard method to predict TFBSs is based on the search for certain sequence patterns with so-called position-specific scoring matrices (PSSMs). Since the information obtained by phylogenetic footprinting gives an independent evidence for the existence of a TFBS, the combination of a PSSM-based prediction of TFBSs and phylogenetic footprinting should reduce the number of false positive predictions. In the context of this work a Hidden Markov model (HMM) has been developed, which combines these two independent methods for the prediction of TFBSs in a synergistic way. The HMM has been parameterized according to the insights about the differing conservation of the TFBSs of certain TFs. On the investigated test data sets the HMM made more accurate predictions than a purely PSSM based search for TFBSs. In certain cases the number of false positive predictions was reduced to a fourth for a given sensitivity. The prediction of TFBSs with this highly accurate method supplies a foundation-stone for the construction of gene regulatory networks and so for a better understanding of the regulation of transcription within a cell.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call