Abstract

BackgroundStructural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio.ResultsHere we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available.ConclusionThe predictive system are publicly available at the address .

Highlights

  • Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, in the ab initio case and when homology information to known structures is available

  • The predictive system are publicly available at the address http://distill.ucd.ie

  • All modern methods for the prediction of protein one-dimensional structural features are based on machine learning techniques [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], and exploit evolutionary information in the form of amino acid frequency profiles extracted from alignments of multiple sequences, generally of unknown structure

Read more

Summary

Introduction

Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, in the ab initio case and when homology information to known structures is available. All modern methods for the prediction of protein one-dimensional structural features (i.e. those features which may be represented as a string of the same length as the primary sequence, such as secondary structure and solvent accessibility) are based on machine learning techniques [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], and exploit evolutionary information in the form of amino acid frequency profiles extracted from alignments of multiple sequences, generally of unknown structure The progress of these methods over the last 10 years has been slow, but steady, and is due to numerous factors: the ever-increasing size of training sets; more sensitive methods for the detection of homologues, such as PSI-BLAST [23]; the use of ensembles of multiple predictors trained independently, sometimes tens of them [12]; more sophisticated machine learning techniques A hint of the historical, more than scientific, nature of this issue is the fact that when subtler algorithms for sequence similarity detection became available (e.g. PSI-BLAST [23]), the criteria for training vs. test set separation did not always change

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call