Computational Gene Identification: Under the Hood

James W Fickett

doi:10.1016/b978-155938979-2/50007-7

Abstract

This chapter reviews the computational techniques for identifying genes in DNA sequences for the scientific layman and describes the working principles, the capabilities, and the limitations of gene identification software. Some attention is also given to likely future developments. The emphasis is on eukaryotes, as in this application domain the problem is of the most interest and difficulty. Two types of computational analysis are normally performed on essentially every newly determined DNA sequence. The first is a database search to compare the new sequence with existing collections (nucleotide sequence, amino acid sequence, or motif). The second, the topic of this study, is a search for protein-coding regions or genes. The chapter describes the three primary means of gathering clues about the existence, location, and function of genes, namely, database similarity search, statistical regularities of coding regions, and pattern recognition of functional sites. The purpose in this review is to provide an overview of these techniques for the person who would like to understand, at a high level, how computational gene identification is done.

Full Text