Abstract

Some natural proteins display recurrent structural patterns. Despite being highly similar at the tertiary structure level, repeating patterns within a single repeat protein can be extremely variable at the sequence level. We use a mathematical definition of a repetition and investigate the occurrences of these in sequences of different protein families. We found that long stretches of perfect repetitions are infrequent in individual natural proteins, even for those which are known to fold into structures of recurrent structural motifs. We found that natural repeat proteins are indeed repetitive in their families, exhibiting abundant stretches of 6 amino acids or longer that are perfect repetitions in the reference family. We provide a systematic quantification for this repetitiveness. We show that this form of repetitiveness is not exclusive of repeat proteins, but also occurs in globular domains. A by-product of this work is a fast quantification of the likelihood of a protein to belong to a family.

Highlights

  • Natural repeat proteins are coded with tandem copies of similar amino acid stretches

  • It is well known that long stretches of perfect repeats are infrequent in natural proteins, even in those that fold into structures of recurrent structural motifs

  • As we have seen previously, long stretches of perfect repeats are infrequent in natural proteins, even for those which are known to fold into structures of recurrent structural motifs

Read more

Summary

Introduction

Natural repeat proteins are coded with tandem copies of similar amino acid stretches. These molecules are broadly classified according to the length of the minimal repeating unit [1]. The solutions to find inexact repeats in sequences [7, 8] include alphabet replacements using scoring matrices, sophisticated notions of sequence similarity based on an allowed percentage of mismatches, and elaborated mathematical representations such as Hidden Markov Models. To a very large extent these solutions have been satisfactory These methods rely on the fine-tuning of different parameters in order to account for the inexactness of repeats (thresholds for alphabet scoring matrices, allowed percentage of mismatches, e-values for Hidden Markov Models and others). The definition of what constitutes or not a hit for the model remains subject to some threshold definition

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call