Abstract
BackgroundUncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets.ResultsWe developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller.ConclusionProtein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.
Highlights
Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools
Evaluation of protein-Gene Ontology (GO) association extracted by MedScan technology The extension of the MedScan natural processing technology to detect GO terms and protein-GO association is described in the Methods section and in Additional file 1
Higher-than-average number of protein interactions within GO annotations To check the hypothesis that cellular functional modularity is achieved by the increased link density in the molecular interaction network and to further study MedScan extraction accuracy, we investigated whether proteins within a GO group had an increased probability to interact with each other than with arbitrary network proteins
Summary
Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. Numerous attempts to detect modules in biological networks have been described [2,3,4] In many of these studies, the Gene Ontology [5] (GO) has been used as the "gold standard" to validate the functional relevance of the found network clusters [6,7]. GO is a directed acyclic graph of terms (nodes) connected with links representing two types of term relations: "is-a" and "part-of." GO has three major branches covering corresponding aspects of protein functions: biological process, molecular function, and cellular components. The approaches for assigning GO terms to proteins can be grouped in two major classes
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.