Abstract
BackgroundThe identification of protein domains plays an important role in protein structure comparison. Domain query size and composition are critical to structure similarity search algorithms such as the Vector Alignment Search Tool (VAST), the method employed for computing related protein structures in NCBI Entrez system. Currently, domains identified on the basis of structural compactness are used for VAST computations. In this study, we have investigated how alternative definitions of domains derived from conserved sequence alignments in the Conserved Domain Database (CDD) would affect the domain comparisons and structure similarity search performance of VAST.ResultsAlternative domains, which have significantly different secondary structure composition from those based on structurally compact units, were identified based on the alignment footprints of curated protein sequence domain families. Our analysis indicates that domain boundaries disagree on roughly 8% of protein chains in the medium redundancy subset of the Molecular Modeling Database (MMDB). These conflicting sequence based domain boundaries perform slightly better than structure domains in structure similarity searches, and there are interesting cases when structure similarity search performance is markedly improved.ConclusionStructure similarity searches using domain boundaries based on conserved sequence information can provide an additional method for investigators to identify interesting similarities between proteins with known structures. Because of the improvement in performance of structure similarity searches using sequence domain boundaries, we are in the process of implementing their inclusion into the VAST search and MMDB resources in the NCBI Entrez system.
Highlights
The identification of protein domains plays an important role in protein structure comparison
Comparison of structure similarity search results Having tested a series of thresholds for identification of differences between sequence and structure-based domains as described in the methods section, we focused on the structure similarity search results using sequence based domains for which at least 90% of the domain consensus sequence was aligned to a structure
These results are derived from applying the methods to the medium redundancy subset of Molecular Modeling Database (MMDB), in which the structure database is reduced by clustering similar structures based on sequence similarity and selecting a single representative from each cluster based on structure quality [30]
Summary
The identification of protein domains plays an important role in protein structure comparison. Domain query size and composition are critical to structure similarity search algorithms such as the Vector Alignment Search Tool (VAST), the method employed for computing related protein structures in NCBI Entrez system. We have investigated how alternative definitions of domains derived from conserved sequence alignments in the Conserved Domain Database (CDD) would affect the domain comparisons and structure similarity search performance of VAST. A common method for identifying similarity between proteins is the use of sequence alignment tools (page number not for citation purposes). While the performance of the methods and databases available are for the most part satisfactory, it is not unusual for such methods to miss certain biologically related protein structures that may be identified by human inspection. Initial reflection on the two possibilities may indicate the first may be most fruitful, there is a great deal that may be done with the data itself
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have