A structural perspective on genome evolution

Juan Garcia Ranea,Alastair Grant,Mark Dibley,David Lee,Ian Sillitoe,Christine Orengo

doi:10.1145/974614.974658

Abstract

At UCL we have developed several automated protocols for generating protein family resources (CATH; Gene3D). These resources can be used to perform comparative genome analyses in order to understand the evolution of protein families. Also to identify biologically and/or medically interesting families for which no structural data currently exists and which may therefore be important targets for structure genomics initiatives.The CATH domain structure database, established by Orengo and Thornton in 1993, now contains a significant proportion of protein structures from the PDB clustered into 1400 evolutionary families. Relationships have been identified using robust structure comparison methods (SSAP, CATHEDRAL). We have also benchmarked and optimised various 1D-profiles and HMM based protocols for assigning genome sequences to families within the resource (e.g. SAM-T99, SAMOSA, CATH-ISL).In this way we can assign structural data to a large proportion (up to 60%) of whole or partial sequences in completed genomes and >80% of genes coding for enzymes and other proteins in biochemical pathways. However, in order to include all families regardless of whether their structure is known or not, a new protein family resource has been developed (Gene3D). In Gene3D, complete genes have been clustered according to sequence similarity alone, using a robust clustering method (Pfscape). 120 completed genomes from all kingdoms have been clustered into 220,000 gene families, 70,000 of which contain 2 or more sequences. Subsequently, we have labelled those gene families for which CATH structural or Pfam functional domain annotations can be provided for all or part of the gene.Preliminary analysis of the genome annotations reveals that a significant proportion (up to 70%) of CATH annotated genes or gene regions in genomes are assigned to domain families that are common to all three kingdoms of life. However, only 20% of the genome sequences are assigned to gene families common to all kingdoms. Since a large proportion of these genes are multidomain proteins this supports the view that a great deal of functional diversity within the genomes has been achieved by combining domain modules in different ways.In collaboration with Professor Janet Thornton, we have analysed a subset of 56 bacterial genomes to determine the recurrence of specific domain structure families within the genomes. This revealed a small but essential group of universal, and in some cases, highly recurring domain families. For some size-dependent families, domain recurrence is highly correlated with increase in genome size, whilst in other size-independent families no correlation is observed. Statistical analysis allowed us to distinguish three groups. Within the size-dependent families we differentiated two groups: linearly-distributed and non-linearly-distributed. Functional annotation using the COGs revealed that these domains were predominantly involved in metabolism and regulation, respectively. Whilst a third group of Evenly-distributed size independent domains are primarily involved in protein translation and biosynthesis.By mapping CATH and Pfam domains families onto all the genome sequences in Gene3D we observe that a few hundred highly recurrent families are dominating at least 50% of whole or partial genome sequences. Many of these families are common to both prokaryotes and eukaryotes and are performing essential generic functions. In many of the largest families, significant divergence in sequence has been accompanied by modifications in structure and function. Targetting representatives in these families for structure determination will allow the structure genomics initiatives to map both fold and function space and reveal the mechanisms by which divergence in protein families promotes evolution of new functions.

Full Text