Abstract

Many researchers have studied the relationship between the biological functions of proteins and the structures of both their overall backbones of amino acids and their binding sites. A large amount of the work has focused on summarizing structural features of binding sites as scalar quantities, which can result in a great deal of information loss since the structures are three-dimensional. Additionally, a common way of comparing binding sites is via aligning their atoms, which is a computationally intensive procedure that substantially limits the types of analysis and modeling that can be done. In this work, we develop a novel encoding of binding sites as covariance matrices of the distances of atoms to the principal axes of the structures. This representation is invariant to the chosen coordinate system for the atoms in the binding sites, which removes the need to align the sites to a common coordinate system, is computationally efficient, and permits the development of probability models. These can then be used to both better understand groups of binding sites that bind to the same ligand and perform classification for these ligand groups. We demonstrate the utility of our method for discrimination of binding ligand through classification studies with two benchmark datasets using nearest mean and polytomous logistic regression classifiers.

Highlights

  • Proteins are molecules consisting of chains of amino acids that fold into a 3-dimensional structure that perform biological functions by binding to various chemicals

  • We present the results of performing classification studies for both datasets using Covariance of Distances to Principal Axes (CDPA) and discuss the particular challenges involved in working with each dataset

  • While it is improper to compare directly to the results of [14] for the extended Kahraman dataset since some of the set’s proteins are no longer listed in the Protein Data Bank (PDB) and we had to perform some light data cleaning, it is clear that CDPA with the logistic regression classifier still performed comparably to [14] for the dataset

Read more

Summary

Introduction

Proteins are molecules consisting of chains of amino acids that fold into a 3-dimensional structure that perform biological functions by binding to various chemicals. From [18], is known in the literature as the Kahraman dataset It consists of 100 protein binding sites which bind to one of 10 ligands (AMP, ATP, FAD, FMN, GLC, HEM, NAD, PO4, EST, AND). Though, when we obtained the 3D structure information for the extended Kahraman dataset from PDB, there were 7 binding sites that were removed from the database, resulting in them not being considered in this analysis While this would prevent us from trying to fully compare our new methodology with other methods, we can still utilize this data to demonstrate the utility of our methods while presenting results for the other methods for reference

Methodology
Results
Discussion and conclusions
Limitations and future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call