Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression

Ruth Veevers,Gavin Cawley,Steven Hayward

doi:10.1186/s12859-020-3464-3

Ruth Veevers, Gavin Cawley + Show 1 more

Open Access

https://doi.org/10.1186/s12859-020-3464-3

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Apr 9, 2020
Citations: 3	License type: open-access

Affiliation: University of East Anglia

Abstract

BackgroundHinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins.ResultsThe DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations.ConclusionIn contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk.

Highlights

Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements
We believe the reason for Pro being located in these regions is that it often acts as a terminator for secondary structure elements and appears at hinge regions because they are often located at the terminal regions of secondary structures [10]
This is thought to be due to its secondary-structure breaking tendency as it is at the termini of secondary structures that hinge bending often occurs

Summary

Introduction

Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. From a structural perspective a domain is characterised as a globular, spatially separate part of a protein and methods have been developed to recognise them from this property [2]. They are considered to be able to fold independently of other parts of the protein and are associated with a distinct function. For protein structure databases such as SCOP [3], SCOP2 [4] and CATH [5] they form the basic element of classification They can be identified from sequence homology using methods such as Pfam [6] where multiple-sequence alignments of family members of a domain are encoded as hidden Markov models

Methods

Results

Discussion

Conclusion