The LabelHash algorithm for substructure matching

Mark Moll,Drew H Bryant,Lydia E Kavraki

doi:10.1186/1471-2105-11-555

Abstract

BackgroundThere is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity.ResultsWe present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose.ConclusionsLabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.

Highlights

There is an increasing number of proteins with known structure but unknown function
When LabelHash is running in parallel, motifs can typically be matched against the entire Protein Data Bank (PDB) on the order of minutes
The algorithm consists of two stages: a preprocessing stage and a stage where a motif is matched against the preprocessed data

Summary

Introduction

There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. A wide variety of substructure matching methods have been proposed, such as: TESS [20], SPASM [21], CavBase [22], eF-site [23], ASSAM [24], PINTS [25], Jess [26], SuMo [27], SiteEngine [28], Query D [29], ProFunc [30], ProKnow [31], SitesBase [32], GIRAF [33], MASH [34], SOIPPA [35,36], FEATURE [37], and pevoSOAR [38] These methods mainly differ in (1) the representation of structural motifs, (2) the motif matching algorithm, and (3) the statistics used to determine significance of match. To assess the statistical significance of matches the use of Extreme Value Distributions [36,42], mixtures of Gaussians [26], and a non-parametric model [35,43] have been proposed

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 11, 2010
Citations: 101	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

The LabelHash algorithm for substructure matching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Pairwise alignment incorporating dipeptide covariation
G E Crooks ... R E Green
Bioinformatics | VOL. 21
G E Crooks, et. al.G E Crooks ... R E Green
25 Aug 2005
Bioinformatics | VOL. 21

Pairwise Heuristic Sequence Alignment Algorithm Based on Deep Reinforcement Learning.
Yong-Joon Song ... Dong Jin Ji
IEEE Open Journal of Engineering in Medicine and Biology | VOL. 2
Yong-Joon Song, et. al.Yong-Joon Song ... Dong Jin Ji
01 Jan 2020
IEEE Open Journal of Engineering in Medicine and Biology | VOL. 2

A structurally‐defined gap function for pairwise sequence alignment of proteins in the twilight zone

The FASEB Journal | VOL. 20

01 Mar 2006
The FASEB Journal | VOL. 20

Global multiple protein-protein interaction network alignment by combining pairwise network alignments.
Jakob Dohrmann ... Rahul Singh
BMC Bioinformatics | VOL. Suppl 16 13
Jakob Dohrmann, et. al.Jakob Dohrmann ... Rahul Singh
01 Dec 2015
BMC Bioinformatics | VOL. Suppl 16 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The LabelHash algorithm for substructure matching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics