Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

Kimberly Van Auken,Joshua Jaffery,Paul W Sternberg,Hans-Michael Müller,Juancarlos Chan

doi:10.1186/1471-2105-10-228

Abstract

BackgroundManual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts.ResultsWe employ the Textpresso category-based information retrieval and extraction system , developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed.ConclusionTextpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.

Highlights

Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor
Words and phrases identified by our word usage and frequency analysis were manually sorted into three categories: Cellular Components, Assay Terms, and Verbs, and included terms such as: nucleus, cell body, centrosomal; expression, antibody, throughout; and detect, exhibited, revealed, respectively
The vast majority of experimental results entered into model organism databases such as WormBase are entered manually by curators who need to identify appropriate papers, read the full text, evaluate the information, and enter annotations using curation tools

Summary

Introduction

Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. For organisms with smaller research communities, functional annotations may initially derive largely from computational or comparative methods which, in turn, can rely heavily upon the accuracy and completeness of model organism genome curation for providing suitable reference annotations and training sets [3,4,5]. Divided into three distinct ontologies that describe Biological Processes, Molecular Functions, and Cellular Components, the GO is used by database curators to record key biological features of a gene product in language that is both humanly readable and computationally amenable. Annotation of a gene product to the Biological Process term cell division (GO:0051301) based upon a mutant phenotype that results in arrested cell division would use the Inferred from Mutant Phenotype (IMP) evidence code. Annotation of a gene product to the Cellular Component term plasma membrane (GO:0005886) based upon immunofluorescence experiments would use the Inferred from Direct Assay (IDA) evidence code. There is a growing need for semi- or fully-automated GO curation strategies that will help database curators rapidly and accurately identify key experimental results in the full text of research articles

Methods

Results

Conclusion