Abstract

BackgroundLarge datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.ResultsA widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.ConclusionsA method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.

Highlights

  • Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins

  • SLiM rediscovery, which was pioneered by PROSITE [10], uses regular expression or profile matching to search for novel instances of previously known SLIMs

  • We introduced SLiMFinder [21], a probabilistic method for SLiM discovery that heuristically accounts for these shortcomings with a two-step scoring scheme

Read more

Summary

Introduction

Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. SLiMs are short (typically between three and ten amino acids in length) and degenerate (positions are often flexible in terms of possible amino acids) making motif context important for specificity due to the limited number of residues in the interaction interface [3]. This simplicity gives them an evolutionary plasticity that is Increased knowledge of SLiM attributes, through the study of known functional motifs, has enabled advancements in computational methods for SLiM discovery. These methods have been used to discover novel instance of both KEN box and EH1 transcriptional repressor motifs [12,13]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call