Mining protein loops using a structural alphabet and statistical exceptionality

Leslie Regad,Anne-Claude Camproux,Gregory Nuel,Juliette Martin

doi:10.1186/1471-2105-11-75

Abstract

BackgroundProtein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.ResultsWe developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.ConclusionsWe developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.

Highlights

Protein loops encompass 50% of protein residues in available three-dimensional structures
We extracted all structural motifs within loops from a non-redundant data set of 8186 protein chains, using the structural alphabet HMM-SA
Each encoded loop was decomposed into overlapping structural words, i.e. series of k consecutive structural letters, corresponding to k - 3 residue fragments

Summary

Introduction

Protein loops encompass 50% of protein residues in available three-dimensional structures. Protein loops are often involved in protein functions [1] They participate in active sites of enzymes [2] and in molecular recognition [3,4]. Protein loops were first seen as random because they are highly variable in terms of sequence and structure and are subject to frequent insertions and deletions [9,10]. Because of their large variability, loops are the protein regions which are the most difficult to analyze and modelize. In protein models, loops, and more long loops, are the place of a lot of errors

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 4, 2010
Citations: 97	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Mining protein loops using a structural alphabet and statistical exceptionality

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Recurrent Structural Motifs in Non-Homologous Protein Structures
Maria Johansson ... Vincent Zoete
International Journal of Molecular Sciences | VOL. 14
Maria Johansson, et. al.Maria Johansson ... Vincent Zoete
10 Apr 2013
International Journal of Molecular Sciences | VOL. 14

Recurrent structural RNA motifs, Isostericity Matrices and sequence alignments
A Lescoute
Nucleic Acids Research | VOL. 33
A LescouteA Lescoute
28 Apr 2005
Nucleic Acids Research | VOL. 33

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs
Leslie Regad ... Juliette Martin
BMC Bioinformatics | VOL. 12
Leslie Regad, et. al.Leslie Regad ... Juliette Martin
20 Jun 2011
BMC Bioinformatics | VOL. 12

SA-Mot: a web server for the identification of motifs of interest extracted from protein loops
Leslie Regad ... Colette Geneix
Nucleic Acids Research | VOL. 39
Leslie Regad, et. al.Leslie Regad ... Colette Geneix
10 Jun 2011
Nucleic Acids Research | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining protein loops using a structural alphabet and statistical exceptionality

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics