Computational protein design: validation and possible relevance as a tool for homology searching and fold recognition.

Marcel Schmidt Am Busch,Audrey Sedano,Thomas Simonson

doi:10.1371/journal.pone.0010410

Marcel Schmidt Am Busch, Audrey Sedano + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0010410

Copy DOI

Abstract

BackgroundProtein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases.Methodology/Principal FindingsWe explore this strategy for four SCOP families: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000–300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed.Conclusions/SignificanceFor some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

Highlights

Protein sequence databases continue to grow rapidly, with *6 million entries in Uniprot [1,2,3,4,5,6,7,8]
Similarity scores are a more reliable measure of the native-like character of designed sequences, because they take into account the diversity of the natural sequences [42,67]
Only 12% of the designed scores overlap with the scores of the small Pfam set; 73% overlap with the scores of a larger Pfam set

Summary

Introduction

Protein sequence databases continue to grow rapidly, with *6 million entries in Uniprot [1,2,3,4,5,6,7,8]. Most protein structures can be subdivided into one or more compact domains, which have their own independent fold. Known domain structures can be classified into a few thousand families, collected in public databases such as Pfam and SCOP [11,12,13,14]. To characterize the 3D structure of a new protein sequence, the first step is to identify one or more homologous proteins of known structure; from these, one can infer, or ‘‘recognize’’ the new protein’s domains and their respective folds. Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases

Methods

Results

Conclusion