Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space.

Kevin Molloy,M Jennifer Van,Amarda Shehu,Daniel Barbara

doi:10.1186/1471-2105-15-s8-s4

Abstract

BackgroundDue to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space.MethodsInspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain.ResultsWe show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership.ConclusionsThis work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools.

Highlights

Due to rapid sequencing of genomes, there are millions of deposited protein sequences with no known function
We investigate a topic-based representation obtained through application of the Latent Dirichlet Allocation (LDA) model
We apply Principal Component Analysis (PCA) here, as well, to visualize co-localization of function in the protein structure space and qualitatively compare these results with the organization readily obtained through the topic-based representation we investigate in this paper

Summary

Introduction

Due to rapid sequencing of genomes, there are millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have been used to obtain linearly-reducible maps of protein structure space. Sequence-based function inference may miss detecting similar proteins where either early branching points (in such case the proteins are referred to as remote homologs) or convergent evolution has resulted in high sequence divergence while largely preserving structure and function. It is worth noting that about 25% of all sequenced proteins are estimated to fall in this category

Methods

Results

Conclusion