Abstract

BackgroundSince the function of a protein is largely dictated by its three dimensional configuration, determining a protein's structure is of fundamental importance to biology. Here we report on a novel approach to determining the one dimensional secondary structure of proteins (distinguishing α-helices, β-strands, and non-regular structures) from primary sequence data which makes use of Parallel Cascade Identification (PCI), a powerful technique from the field of nonlinear system identification.ResultsUsing PSI-BLAST divergent evolutionary profiles as input data, dynamic nonlinear systems are built through a black-box approach to model the process of protein folding. Genetic algorithms (GAs) are applied in order to optimize the architectural parameters of the PCI models. The three-state prediction problem is broken down into a combination of three binary sub-problems and protein structure classifiers are built using 2 layers of PCI classifiers. Careful construction of the optimization, training, and test datasets ensures that no homology exists between any training and testing data. A detailed comparison between PCI and 9 contemporary methods is provided over a set of 125 new protein chains guaranteed to be dissimilar to all training data. Unlike other secondary structure prediction methods, here a web service is developed to provide both human- and machine-readable interfaces to PCI-based protein secondary structure prediction. This server, called PCI-SS, is available at . In addition to a dynamic PHP-generated web interface for humans, a Simple Object Access Protocol (SOAP) interface is added to permit invocation of the PCI-SS service remotely. This machine-readable interface facilitates incorporation of PCI-SS into multi-faceted systems biology analysis pipelines requiring protein secondary structure information, and greatly simplifies high-throughput analyses. XML is used to represent the input protein sequence data and also to encode the resulting structure prediction in a machine-readable format. To our knowledge, this represents the only publicly available SOAP-interface for a protein secondary structure prediction service with published WSDL interface definition.ConclusionRelative to the 9 contemporary methods included in the comparison cascaded PCI classifiers perform well, however PCI finds greatest application as a consensus classifier. When PCI is used to combine a sequence-to-structure PCI-based classifier with the current leading ANN-based method, PSIPRED, the overall error rate (Q3) is maintained while the rate of occurrence of a particularly detrimental error is reduced by up to 25%. This improvement in BAD score, combined with the machine-readable SOAP web service interface makes PCI-SS particularly useful for inclusion in a tertiary structure prediction pipeline.

Highlights

  • Since the function of a protein is largely dictated by its three dimensional configuration, determining a protein's structure is of fundamental importance to biology

  • A wide variety of methods have been applied to this problem including those based on artificial neural networks (ANNs) [3,4,5,6,7,8], hidden Markov models (HMMs) [8,9], information theory [5], linear programming [10], and linear discriminant analysis (LDA) [5], no method has achieved the theoretical maximum predictive Q3 accuracy of 88% [2]

  • PSIPRED-local refers to the output of PSIPRED v2.45 run locally when provided with position-specific scoring matrices (PSSMs) data generated from the filtered NCBI non-redundant nr database as frozen on 3 May 2004

Read more

Summary

Introduction

Since the function of a protein is largely dictated by its three dimensional configuration, determining a protein's structure is of fundamental importance to biology. We report on a novel approach to determining the one dimensional secondary structure of proteins (distinguishing α-helices, β-strands, and nonregular structures) from primary sequence data which makes use of Parallel Cascade Identification (PCI), a powerful technique from the field of nonlinear system identification. Computational prediction techniques provide an attractive alternative; the accurate prediction of 3D protein structure directly from amino acid sequence data continues to elude researchers when homologous protein structures are not available (comparative modeling), or for longer domains (de novo modeling). As an intermediate but useful step, attempts have been made to determine the one dimensional secondary structure of proteins (distinguishing α-helices, β-strands, and non-regular structure) from primary sequence data [2]. Note that this study focuses on predicting secondary structure of globular proteins. Excluded proteins include those with coiled-coil regions or transmembrane domains

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.