Gene Prediction with a Hidden Markov Model

Mario Stanke

doi:10.53846/goediss-2451

Abstract

Annotation of the large and rapidly increasing amount of genomic sequence data requires computational tools for finding genes in DNA sequences. This thesis presents a computational method for finding protein-coding genes encoded in DNA sequences of eukaryotes (plants and animals). We introduce a so-called generalized Hidden Markov Model (GHMM) for eukaryotic genomic sequences.This model, called AUGUSTUS, is a probabilistic model of a DNA sequence with the gene structure underlying the sequence. It defines a probability distribution on the set of all possible pairs of a DNA sequence and its annotation of protein-coding regions. Genes in an input DNA sequence can be uncovered by finding the gene structure which is most likely in the probabilistic model given the input DNA sequence. The most likely gene structure of the input DNA sequence is searched by a computer program, which can be done both exactly and efficiently because of the relatively simple dependency structure of the distribution defined by a GHMM. In order for the model to fit well the actual distribution of true sequences and their annotations several new methods have been applied. A GHMM for gene prediction contains probabilistic state models for different functional parts of the genomic sequence, such as translational and splicing signals and coding regions. For the splice sites new probabilistic submodels are introduced. A method is used to better estimate the parameters of the model depending on the average base frequency.Further, the following issue is addressed. A GHMM may model the length distribution of certain structural parts of the sequence, such as introns. The disadvantage of the procedures used in existing programs was that they either caused prohibitively long running times or they modeled the true length distribution inadequately. An approach presented here allows to approximate a given length distribution by an arbitrary initial part and a geometric tail at relatively low computational cost.A computer program based on this model has been tested on DNA sequences with known annotation from human and the fruit fly Drosophila melanogaster. The accuracy of the predictions compare favorably to that of other well known, established gene prediction programs.The second major part of the thesis addresses insecure external information about the gene structure and presents a method for integrating external information into a GHMM for gene prediction. The GHMM AUGUSTUS is extended to a new GHMM AUGUSTUS+ which is a probabilistic model of all possible triplets formed by a DNA sequence, its annotation, and external information about the sequence.The gene prediction program then finds the most likely annotation given both the input DNA sequence and the external evidence. It accounts for the fact that such external evidence can be misleading. The parameters corresponding to the distribution of the external information can be easily estimated. This leads to a naturally justified increase in likelihood of gene structures respecting the external information compared to the likelihood in the previous model. The method allows to make use of evidence about a range of the DNA sequence, e.g. evidence that a certain range of the sequence is protein-coding, without preferring gene structures that only "partially respect" that evidence over those which do not respect at all the evidence. Another advantage of the method presented here, compared to ad-hoc methods for integrating external information into existing programs, is that the underlying theory of GHMMs still applies to the model.Experiments with AUGUSTUS+ show that the use of extrinsic information coming from EST database searches can significantly improve the prediction accuracy of gene prediction programs when combined with protein database searches.

Full Text