Abstract

Finding orthologous genes, similar genes in different genomes, is a fundamental problem in comparative genomics. We present a model for automatically extracting candidate ortholog clusters in a large set of genomes using a new clustering method for multipartite graphs. The groups of orthologous genes are found by focusing on the gene similarities across genomes rather than similarities between genes within a genome. The clustering problem is formulated as a series of combinatorial optimization problems whose solutions are interpreted as ortholog clusters. The objective function in optimization problem is a quasi-concave set function which can be maximized efficiently. The properties of these functions and the algorithm to maximize these functions are presented. We applied our method to find ortholog clusters in data which supports the manually curated Cluster of Orthologous Genes (COG) from 43 genomes containing 108,090 sequences. Validation of candidate ortholog clusters was by comparison against the manually curated ortholog clusters in COG, and by verifying annotations in Pfam and SCOP – in most cases showing strong correlations with the known results. An analysis of Pfam and SCOP annotations, and COG membership for sequences in 7,701 clusters which include sequences from at least three organisms, shows that 7,474(97%) clusters contain sequences that are all consistent in at least one of the annotations or their COG membership.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call