Introduction Computational methods for evaluating protein structure have led to a deepened understanding of how aligned sequences of amino acid chains may be used to infer structural relationships. We developed a pattern discovery program for predicting and visualizing higher order group relationships and their correspondences from large scale multiple sequence alignments across multiple species that can enhance the understanding of submolecular protein structure, even for proteins that do not have a solved structure. Methods We developed PSICalc (Protein Sequence Interdependency Calculator) software, using a novel algorithm based originally on the k-modes algorithm that clusters substructural components using normalized mutual information as a measure of intra-site interdependency [Durston, K.K., Chiu, D.K.Y., Wong, A.K.C., Li, G.C.L., Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring, EURASIP J Bioinform Syst Biol. (2012) Jul 13;2012(1):8. doi: 10.1186/1687-4153-2012-8]. To aid in visualizing interactions across multiple subdomains, we developed PSICalc Viewer, a complete graphical user interface that allows researchers to provide a multiple sequence alignment and a set of available input parameters to visualize predicted protein structure as a polytree graph. Results The structure of ubiquitin is well-known and served as a base case for validating the amino acid pairs and groupings identified by the algorithm. The algorithm was able to correctly group amino acids into pairs and clusters representing features of the three dimensional structure of ubiquitin. As one example, the algorithm clustered amino acids 27, 41, and 52, which represent adjacent positions in the three dimensional structure of ubiquitin (1UBQ). Additionally, the algorithm identified pairs and clusters of amino acids around the ATP binding pocket of the ATPase domain of topoisomerase II using a dataset of HATPase domains from the Pfam database. Conclusions The software tool employs the PSICalc algorithm, which allows the researcher to predict features of submolecular protein structure. The pairs and clusters of amino acids identified by the algorithm appear to have structural relevance either through close interactions or mutual participation in larger folds or domains. Additional studies are currently being pursued to apply the software tool to other proteins of interest. It is expected that this tool will aid in the examination of proteins of unknown structure through identification of submolecular interdependencies in protein sequences.
Read full abstract