Abstract

Proteins are characterized by their structures and functions, and these two fundamental aspects of proteins are assumed to be related. To model such a relationship, a single representation to model both protein structure and function would be convenient, yet so far, the most effective models for protein structure or function classification do not rely on the same protein representation. Here we provide a computationally efficient implementation for large datasets to calculate residue cluster classes (RCCs) from protein three-dimensional structures and show that such representations enable a random forest algorithm to effectively learn the structural and functional classifications of proteins, according to the CATH and Gene Ontology criteria, respectively. RCCs are derived from residue contact maps built from different distance criteria, and we show that 7 or 8 Å with or without amino acid side-chain atoms rendered the best classification models. The potential use of a unified representation of proteins is discussed and possible future areas for improvement and exploration are presented.

Highlights

  • Proteins are molecules found in living organisms and participate in many diverse cellular and molecular functions

  • We believe having a unique protein representation with which to efficiently predict 3D protein structure and function would provide a mathematical framework to explore the relationship between these two fundamental aspects of proteins, 3D structure and function

  • The representation is based on counting the 26 different maximal clique classes that are derived from the 3D structure and protein sequence given a contact distance threshold of 5 Å, including atoms of the side chains; we referred to these maximal cliques as residue cluster classes or RCCs

Read more

Summary

Introduction

Proteins are molecules found in living organisms and participate in many diverse cellular and molecular functions. Machine-learning (ML) based models are currently the best models for predicting 3D protein structures [7] and protein functions [8]. The representation is based on counting the 26 different maximal clique classes that are derived from the 3D structure and protein sequence given a contact distance threshold of 5 Å, including atoms of the side chains; we referred to these maximal cliques as residue cluster classes or RCCs (see Figure 1 and Materials and Methods). We developed a computationally efficient implementation for computing RCCs on large datasets of 3D protein structures This allowed us to further explore the protein structure classification and distribute this implementation freely (see Supplementary Materials); this implementation incorporates some variations in the contact definition (different distances and the exclusion of the side-chain atoms). 2020, 22, 472 classification and protein function using ML methods

RCC Calculation Implementation
Contact Map Calculation
Maximal Cliques Calculation
RCC Calculation
RCC Database
Model Training and Testing
Protein Structural Classification
Protein Functional Classification
Baseline
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.