The process of determining the functional sequence content of an organism is confounded by several factors. Large protein coding sequences are relatively easy to find by statistical methods. Smaller proteins however may escape detection due to their size falling below some arbitrary researcher-defined minimum cutoff, or the inability to precisely define a promoter, or translational start (Delcher et al., Nucleic Acids Res., 27, 4636-4641, 1999). Promoter and regulatory sequences themselves are difficult to define due to a significant amount of allowable sequence variation, as well as a probable lack of any completely accurate whole-organismal gene catalogs to date. Finally, certain genes coding functional RNAs may have insufficient structural or sequence constraints to be detectable by normal sequence structure/pattern searching methods (Eddy and Rivas, Bioinformatics, 16, 583-605, 2000). In those cases where there are multiple closely related organisms that have been sequenced, there is additional information that may be used in the investigation of sequence content-that being the possible conserved nature of functional sequences between the organisms. We present a method for the utilization of this conserved information to detect genes and other potentially functional sequences that may be missed by standard ORF-calling, RNA finding, and pattern matching software. The tricross programs produce a multi-way cross comparison of three sets of sequences, determine which are conserved in all three sets, and produce a graphical (Virtual Reality Modelling Language-VRML; (ISO/IEC 14772-1: 1997, VDC), 1997) representation as well as alignments of all sequence triples found. The software can also be applied to a pair of sequence sets, though the noise in the results increases. Tricross has been used to examine the intergenic-sequence content of the three archaeal Pyrococcus genomes to determine the most highly related sequences remaining between the annotated protein and RNA coding sequences. Set to relatively stringent similarity requirements for the search, tricross found 101 intergenic sequences conserved among the three organisms. Interestingly, 29 of these appear to contain members of a family of small RNA molecules (Kiss-Laszlo et al., EMBO J., 17, 797-807, 1998) only recently discovered in the Archaea (Armbruster, OSU, Diss., 1988; Omer et al., Science, 288, 517-522, 2000; Gaspin et al., J. Mol. Biol., 297, 895-906, 2000). While some of the remaining 72 appear to be individual highly conserved promoter sequences, others have no currently known biological significance. Although originally developed to facilitate the examination of intergenic sequences, none of the tricross logic is inherently specific to intergenic sequences. The software can also be applied to gene sequences, and has been used to produce inter-genomic gene order dot-plots for Haemophilus influenzae (Fleischmann et al., Science, 269, 496-512, 1995) versus H.ducreyi (unpublished data), and Neisseria meningiditis Z2491 (serogroup A) (Parkhill et al., Nature, 404, 502-506, 2000) versus Neisseria meningiditis Z58 (serogroup B) (Tettelin et al., Science, 287, 1809-1815, 2000) versus Neisseria gonorrhoeae (Lewis et al., http://micro-gen.ouhsc.edu/, 2000). The tricross software package is available from http://www.biosci.ohio-state.edu/~ray/bioinformatics/tricross.html. ray@biosci.ohio-state.edu; daniels.7@osu.edu; munsonr@pediatrics.ohio-state.edu Additional data from the cross-genomic comparisons examined in the discussion section are linked from http://www.biosci.ohio-state.edu/~ray/bioinformatics/tricross.html.
Read full abstract