Abstract
Gene finding can be defined as a problem of identifying a stretch of the genomic DNA sequence that is biologically functional. Such a genomic DNA sequence is known as a gene. A gene performs a function like protein coding or regulation at the molecular level and plays a biological role, such as growth, metabolism, and intelligence. Traditionally, gene finding relies on numerous biological experiments and statistical analysis to pinpoint the location of a new gene in a genetic map. With the advent of bioinformatics, gene finding has largely become a computational problem. Genes are predictable based on the genomic sequence alone. However, the determination of the specific function and biological role of a gene would still demand in vivo experimentation, which is hoped to be reduced or even replaced by new bioinformatics algorithms in the future. A newly sequenced genome is annotated thoroughly so that the information it carries can be utilized. In essence, genome annotation is to identify the locations of genes and all of the coding regions in a genome, and determine their protein products as well as functions. Hundreds of bacterial genome sequences are publicly available and the number will soon reach a new milestone. Gene annotation by hand is almost impossible to handle the deluge of new genome sequences appearing at this pace. The need for automated, large-scale, highthroughput genome annotation is imminent (Overbeek, Begley et al. 2005; Van Domselaar, Stothard et al. 2005; Stothard and Wishart 2006). The basic level of genome annotation is the use of BLAST (Altschul, Gish et al. 1990) for finding similarities between related genomic sequences. Integration with other sources of information and experimental data is a trend in genome annotation. A recent study indicates that many genomes could be either over-annotated (too many genes) or under-annotated (too few genes), and a large percentage of genes may have been assigned a wrong start codon (Nielsen and Krogh 2005). The fact that the original genome annotation is accurate and complete upon submission does not guarantee that it will not be changed, as new experimental evidence and knowledge would continue to arrive and constant updates would be inevitable. However, re-annotation of the whole genome is not very fruitful, as most of the genes have been identified in the first annotation. For example, the re-annotation of the H37Rv genome resulted in about 2% of new protein-coding sequences (CDS) added to the genome. The result reflects the limitation with current genome annotation technology. To address the issue, we developed a new method for gene finding in an annotated genome. We select the genome of Mycobacterium tuberculosis, the
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have