Abstract

We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1nalphan, these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.

Highlights

  • In the study of proteins, a variety of methods derive, from analyzing a multiple alignment of related sequences or a corresponding phylogenetic tree, predictions that a specific set of residues within a particular protein are functionally important (Lichtarge et al, 1996; Mihalek et al, 2004; Fischer et al, 2008; Sankararaman and Sjolander, 2008; Janda et al, 2012; Neuwald and Altschul, 2016)

  • We study a simple abstract problem motivated by a variety of applications in protein sequence analysis

  • If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases

Read more

Summary

INTRODUCTION

In the study of proteins, a variety of methods derive, from analyzing a multiple alignment of related sequences or a corresponding phylogenetic tree, predictions that a specific set of residues within a particular protein are functionally important (Lichtarge et al, 1996; Mihalek et al, 2004; Fischer et al, 2008; Sankararaman and Sjolander, 2008; Janda et al, 2012; Neuwald and Altschul, 2016). We focus on clusters of residues that are ‘‘distinguished’’ by a computer analysis of multiple alignments, rather than on residues having specific physical–chemical properties such as charge (Karlin and Zhu, 1996; Zhu and Karlin, 1996). This has no formal effect on our approach, only a motivational one. We use the Minimum Description Length (MDL) principle (Grunwald, 2007) to formulate and analyze our problem in simple abstract terms, and illustrate its application to a specific protein sequence and structure

FORMALIZATION USING THE MDL PRINCIPLE
The description length of S given H1
The complexity of H1
Flattening Jeffreys’ priors
RANDOM SIMULATION
APPLICATION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call