Abstract

We review recent developments in spaced seed design for cross-species sequence alignment. We start with a brief overview of original ideas and early techniques, and then focus on more recent work on finding accurate (sensitive and specific) seeds for cross-species cDNA-to-genome alignment. These recent developments include methods and models for estimating seed specificity and determining sensitive and specific seeds, finding seeds that can be applied to a wide range of comparisons, and applying seed models to other computational biology areas, such as gene finding. 1. Introduction. New high-throughput and cost-effective technologies have rev- olutionized our ability to sequence complex organisms, and are expected to lead to a significant increase in the number of available genomes for species from all branches of life (1). The first and most important step in analyzing these genomes is gene annotation, that is, accurately identifying the locations and exon-intron structures of genes along the genome, and further determining their function. There are two primary classes of methods for identifying genes in a given genomic sequence. The first class, ab initio methods (GenScan (2), Genie (3), GeneMark (4), FGenesH (5)), use machine-learning techniques to analyze a single genomic sequence and predict the locations of genes. Such methods are reasonably accurate at finding coding exons, but are not effective at detecting untranslated regions (UTRs) and alternatively spliced or overlapping genes (6). The second class, comparative methods, predict exons based on sequence similarity of protein or expressed DNA (cDNA, EST, mRNA) with genomic sequences containing those genes. These methods are the most reliable for inferring the gene structure, and thus genome annotation projects have routinely used cDNA sequences from the same species to annotate genes. Although several projects exist that produce full- length cDNA sequences (7-9), they focus on a handful of high-priority species, such as human, mouse, rat, cow and zebrafish. For most newly sequenced species, few native cDNA sequences are available in the databases. Consequently, gene annotation

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.