Abstract
Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net.
Highlights
The availability of a large variety of genomic data relevant to the task of protein function prediction poses a data integration challenge due to the heterogeneity of the data sources
In earlier work we have shown the power of modeling Gene Ontology (GO) term prediction as a hierarchical classification problem using a generalization of the binary SVM to structured output spaces, which allows us to directly predict the GO categories associated with a given
In addition to data that is commonly used in prediction of protein function, namely gene expression and proteinprotein interactions (PPI), we report the successful use of large-scale data mined from the biomedical literature, and find that it provides a large boost in accuracy
Summary
The availability of a large variety of genomic data relevant to the task of protein function prediction poses a data integration challenge due to the heterogeneity of the data sources. In addition to data that is commonly used in prediction of protein function, namely gene expression and proteinprotein interactions (PPI), we report the successful use of large-scale data mined from the biomedical literature, and find that it provides a large boost in accuracy. Together with the text mining data, features based on sequence similarity and PPI account for most of the predictor performance. We examined the tasks of predicting molecular function, biological process and cellular component in isolation. GO terms belong to three namespaces that describe a gene product’s function: its function on the molecular level, the biological processes in which it participates, and its localization to a cellular component. A number of methods employ sequence and structural similarity to make functional annotation predictions with varying degrees of accuracy [11,12,13,14,15]. New schemes are still being proposed today, an example being the algorithm by Hamp, et al that was used in the 2011 CAFA challenge [12]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have