Proteins of related functions are often similar in sequence, reflecting a common phylogenetic origin. Proteins with no known homology are probably diversified proteins, too distantly related to known sequences in databases to retain significant similarity. All proteins, however, probably share common ancestries if one moves far enough back in evolution; therefore, given the huge accumulation of protein sequences in current databases, it could be expected that some proteins with no obvious sequence resemblance to any other share some residues that could represent footprints of ancient common ancestries. To identify such putative footprints, we have searched for short stretches of amino acids present in a given protein sequence that are also found in a significant number of nonrelated proteins in the database. The significantly high frequency of occurrence of these "patterns" in the database would support a common evolutionary source, and a diversity of non-related proteins that contain the pattern would express their ancient origin. Using this strategy, significant patterns were found in actual exons, but not in randomized amino acid sequences, nor in "translated" sequences of noncoding DNA, suggesting that this strategy actually leads to the identification of patterns with a biological significance. These significant patterns are not randomly positioned along the sequences analyzed, but they tend to accumulate within specific regions, producing a profile of discrete "domains." In some well-known proteins analyzed in this study, some of these domains are coincident with known motifs. Thus, the procedure described in this paper could be useful for identifying ancient patterns and domains in protein sequences, some of which could also have a functional or structural significance.
Read full abstract