Identification and distribution of protein families in 120 completed genomes using Gene3D

David Lee,Christine Orengo,Alastair Grant,Russell L Marsden

doi:10.1002/prot.20409

Abstract

Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Identification and distribution of protein families in 120 completed genomes using Gene3D

Abstract

Talk to us

Similar Papers

More From: Proteins: Structure, Function, and Bioinformatics

Lead the way for us

Journal: Proteins: Structure, Function, and Bioinformatics	Publication Date: Mar 14, 2005
Citations: 41

Similar Papers

Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Errors Caused by Confusing Paralogs and Epaktologs
Alinda Nagy ... László Bányai
Genes | VOL. 2
Alinda Nagy, et. al.Alinda Nagy ... László Bányai
02 Aug 2011
Genes | VOL. 2

Evolution of Protein Domain Architectures
Sofia K Forslund ... Erik L L Sonnhammer
-
Sofia K Forslund, et. al.Sofia K Forslund ... Erik L L Sonnhammer
01 Jan 2019
01 Jan 2019

Evolution of Protein Domain Architectures
Kristoffer Forslund ... Erik L L Sonnhammer
-
Kristoffer Forslund, et. al.Kristoffer Forslund ... Erik L L Sonnhammer
01 Jan 2012
01 Jan 2012

Reassessing Domain Architecture Evolution of Metazoan Proteins: The Contribution of Different Evolutionary Mechanisms
Alinda Nagy ... Laszlo Patthy
Genes | VOL. 2
Alinda Nagy, et. al.Alinda Nagy ... Laszlo Patthy
05 Aug 2011
Genes | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identification and distribution of protein families in 120 completed genomes using Gene3D

Abstract

Talk to us

Similar Papers

More From: Proteins: Structure, Function, and Bioinformatics