Abstract
BackgroundSequence matching is extremely important for applications throughout biology, particularly for discovering information such as functional and evolutionary relationships, and also for discriminating between unimportant and disease mutants. At present the functions of a large fraction of genes are unknown; improvements in sequence matching will improve gene annotations. Universal amino acid substitution matrices such as Blosum62 are used to measure sequence similarities and to identify distant homologues, regardless of the structure class. However, such single matrices do not take into account important structural information evident within the different topologies of proteins and treats substitutions within all protein folds identically. Others have suggested that the use of structural information can lead to significant improvements in sequence matching but this has not yet been very effective. Here we develop novel substitution matrices that include not only general sequence information but also have a topology specific component that is unique for each CATH topology. This novel feature of using a combination of sequence and structure information for each protein topology significantly improves the sequence matching scores for the sequence pairs tested. We have used a novel multi-structure alignment method for each homology level of CATH in order to extract topological information.ResultsWe obtain statistically significant improved sequence matching scores for 73 % of the alpha helical test cases. On average, 61 % of the test cases showed improvements in homology detection when structure information was incorporated into the substitution matrices. On average z-scores for homology detection are improved by more than 54 % for all cases, and some individual cases have z-scores more than twice those obtained using generic matrices. Our topology specific similarity matrices also outperform other traditional similarity matrices and single matrix based structure methods. When default amino acid substitution matrix in the Psi-blast algorithm is replaced by our structure-based matrices, the structure matching is significantly improved over conventional Psi-blast. It also outperforms results obtained for the corresponding HMM profiles generated for each topology.ConclusionsWe show that by incorporating topology-specific structure information in addition to sequence information into specific amino acid substitution matrices, the sequence matching scores and homology detection are significantly improved. Our topology specific similarity matrices outperform other traditional similarity matrices, single matrix based structure methods, also show improvement over conventional Psi-blast and HMM profile based methods in sequence matching. The results support the discriminatory ability of the new amino acid similarity matrices to distinguish between distant homologs and structurally dissimilar pairs.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1198-z) contains supplementary material, which is available to authorized users.
Highlights
Sequence matching is extremely important for applications throughout biology, for discovering information such as functional and evolutionary relationships, and for discriminating between unimportant and disease mutants
Standard matrices like Blosum62 have been generated without taking into account any topological information, the statistics of amino acid substitutions vary with protein topology
We have used structural alignment of protein structures belonging to each CATH topology and used these alignments to develop similarity matrices for each CATH topology by making amino acid substitution assignments directly from the structure alignments
Summary
Sequence matching is extremely important for applications throughout biology, for discovering information such as functional and evolutionary relationships, and for discriminating between unimportant and disease mutants. Universal amino acid substitution matrices such as Blosum are used to measure sequence similarities and to identify distant homologues, regardless of the structure class. Such single matrices do not take into account important structural information evident within the different topologies of proteins and treats substitutions within all protein folds identically. We develop novel substitution matrices that include general sequence information and have a topology specific component that is unique for each CATH topology This novel feature of using a combination of sequence and structure information for each protein topology significantly improves the sequence matching scores for the sequence pairs tested. Improving protein sequence matching should enable improving both the identification of remote homologs, for the predictions of the structures and function of large numbers of protein sequences
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have