Abstract

BackgroundThe recent explosion in the availability of complete genome sequences has led to the cataloging of tens of thousands of new proteins and putative proteins. Many of these proteins can be structurally or functionally categorized from sequence conservation alone. In contrast, little attention has been given to the meaning of poorly-conserved sites in families of proteins, which are typically assumed to be of little structural or functional importance.ResultsRecently, using statistical free energy analysis of tetratricopeptide repeat (TPR) domains, we observed that positions in contact with peptide ligands are more variable than surface positions in general. Here we show that statistical analysis of TPRs, ankyrin repeats, Cys2His2 zinc fingers and PDZ domains accurately identifies specificity-determining positions by their sequence variation. Sequence variation is measured as deviation from a neutral reference state, and we present probabilistic and information theory formalisms that improve upon recently suggested methods such as statistical free energies and sequence entropies.ConclusionSequence variation has been used to identify functionally-important residues in four selected protein families. With TPRs and ankyrin repeats, protein families that bind highly diverse ligands, the effect is so pronounced that sequence "hypervariation" alone can be used to predict ligand binding sites.

Highlights

  • The recent explosion in the availability of complete genome sequences has led to the cataloging of tens of thousands of new proteins and putative proteins

  • Even when the ligands have significant features in common, sequence variation can be used to "dissect" binding sites to identify specificity-determining residues. We demonstrate this sequence variation using probabilistic and information theory approaches closely tied to the mathematics of covariation, which are more suitable for this type of analysis than statistical free energy (SFE) or Shannon entropies

  • We calculated the separation of the amino acid distribution at each position in tetratricopeptide repeat (TPR) from a position-independent reference state, amino acid usage in all proteins in yeast (Figure 1a)

Read more

Summary

Introduction

The recent explosion in the availability of complete genome sequences has led to the cataloging of tens of thousands of new proteins and putative proteins. Many of these proteins can be structurally or functionally categorized from sequence conservation alone. Little attention has been given to the meaning of poorly-conserved sites in families of proteins, which are typically assumed to be of little structural or functional importance. Sequence conservation alone can be used to structurally categorize many proteins or putative proteins [3]. Sites with poor sequence conservation have been largely ignored, because they are assumed to be of little structural or functional importance [9]. Solvent-exposed residues (page number not for citation purposes)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.