Because proteins that have diverged beyond significant sequence similarity still retain the three-dimensional (3D) fold of their ancestor (Chothia and Lesk, 1986; Rost, 1997), the recognition of structural similarity between proteins provides powerful clues to ancestry. In fact, a large number of distant homology relationships were identified only after the structures of the proteins had been solved (Murzin, 1998). However, structures are being determined only for a small fraction of the proteins. There is a pressing need for improvement in the performance of sequence-based methods for the detection of proteins with the same fold but scant sequence similarity. Here, we examine how to achieve this goal by combining three kinds of information from a protein sequence. First, it has long been recognized that the use of multiplyaligned sequences from a protein family improves the sensitivity of homology detection. This idea is used by many recent computational procedures that exploit evolutionary information to uncover subtle sequence similarity. Examples of such procedures include sequence profiles (Gribskov et al., 1987), consensus templates or motifs (Taylor, 1986; Bairoch, 1991; Tatusov et al., 1994; Yi and Lander, 1994), positionspecific scoring matrices (PSSMs) (Henikoff and Henikoff, 1997), profile hidden Markov models (Eddy, 1998), and intermediate sequence methods (Holm and Sander, 1997; Neuwald et al., 1997; Park et al., 1997). PSI-BLAST (Altschul et al., 1997), one of the most widely used of these procedures, employs an iterative profile search strategy that combines the advantages of both PSSM and intermediate sequence methods. This program has been used effectively by several groups to assign 3D folds to predicted genome products (Teichmann et al., 1999). Second, proteins having the same fold also by definition have very similar secondary structures. In the light of the improved accuracy of secondary structure prediction (Rost and Sander, 1993), several groups have attempted to use sequencederived predictions to improve the sensitivity of fold recognition (Fischer and Eisenberg, 1996; Russel et al., 1996; Di Francesco et al., 1997; Rice and Eisenberg, 1997; Rost et al., 1997). These methods usually represent each protein in a template library by a one-dimensional (1D) string of symbols (profiles) each representing a distinctive 3D structural state, and then use dynamic programming (Needleman and Wunsch, 1970) to align the predicted structural profiles of the query
Read full abstract