Abstract

The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers—decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron—were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use.Electronic supplementary materialThe online version of this article (doi:10.1007/s00894-016-2933-0) contains supplementary material, which is available to authorized users.

Highlights

  • Some simple combinations of protein secondary-structural elements that are found to occur frequently in proteins are referred to as super-secondary structures or motifs

  • We examined six different machine-learning algorithms using a carefully chosen feature set consisting of a hydrophobicity index, a linker index, polarity values, ordered/ disordered regions in the protein sequence, and flexibility parameters for residue-level protein domain boundary prediction from sequence information

  • We considered six different types of classifiers: decision tree (DT), Gaussian naïve Bayes (GNB), linear discriminant analysis (LDA), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP)

Read more

Summary

Introduction

Some simple combinations of protein secondary-structural elements that are found to occur frequently in proteins are referred to as super-secondary structures or motifs. Several motifs pack together to form compact, local, semi-independent units called domains. A domain is a segment of a polypeptide chain that can fold into a three-dimensional structure irrespective of the presence of other segments of the chain [1]. The overall 3D structure of a protein’s polypeptide chain is referred to as its tertiary structure, whereas the domain is the fundamental building block of tertiary structure. Each domain contains a hydrophobic core built from secondary-structural units connected by loop regions. Two-thirds of the proteins in unicellular organisms and more than 80 % of those in metazoans are multidomain proteins created as a result of gene duplication events. As the complexity of an organism increases, the number of domains in its proteins increases.

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.