Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

Andrew F Neuwald,Christopher J Lanczycki,Aron Marchler-Bauer

doi:10.1186/1471-2105-13-144

Abstract

BackgroundThe NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches.ResultsHere we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day.ConclusionsThis approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.

Highlights

The National Center for Biotechnology Information (NCBI) Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features
In order to provide rapid and sensitive annotation for protein sequences, including direct links to structural and functional information, the National Center for Biotechnology Information (NCBI) initiated the Conserved Domain Database (CDD) [1] —a collection of positionspecific scoring matrices (PSSMs) that are derived from protein multiple sequence alignments
Results and discussion we lay out the basic automated mcBPPS (amcBPPS) algorithm, illustrate an implementation of the algorithm as applied to P-loop GTPases, compare its performance against various manually-curated conserved domain (CD) hierarchies, further evaluate its performance using both delete-half jackknifing and simulations, and apply it to several large protein classes for which existing hierarchies or alignments are currently unavailable

Summary

Introduction

The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Our focus here is to obtain heuristically an initial (presumably sub-optimal) CD hierarchy starting from a typically very large multiple sequence alignment for an entire protein class whose domain boundaries remain fixed To do this we utilize procedures that obtain both subgroup assignments for aligned sequences and corresponding discriminating patterns associated with protein functional-divergence

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jun 22, 2012
Citations: 11	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Abstract 1090: Evolutionary, structural, and functional insights into the seven-transmembrane GPCR superfamily through NCBI's Conserved Domain Database
James S Song ... Noreen R Gonzales
Cancer Research | VOL. 75
James S Song, et. al.James S Song ... Noreen R Gonzales
01 Aug 2015
Cancer Research | VOL. 75

Abstract 2900: Phylogenetic classification, structural evolution, and functional divergence of the Ras superfamily of GTPases as tracked by the Conserved Domain Database (CDD)..
James S Song ... Myra K Derbyshire
Cancer Research | VOL. 73
James S Song, et. al.James S Song ... Myra K Derbyshire
15 Apr 2013
Cancer Research | VOL. 73

Abstract 95: Phylogenetic classification, structural evolution, and functional divergence of the ABC-type transporters: Inferences derived from CDTree/Cn3D software applications
James S Song ... Aron Marchler-Bauer
Cancer Research | VOL. 70
James S Song, et. al.James S Song ... Aron Marchler-Bauer
15 Apr 2010
Cancer Research | VOL. 70

The conserved domain database in 2023.
Jiyao Wang ... Shennan Lu
Nucleic Acids Research | VOL. 51
Jiyao Wang, et. al.Jiyao Wang ... Shennan Lu
08 Dec 2022
The conserved domain database in 2023.
Jiyao Wang ... Shennan Lu

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics