Textpresso: an ontology-based information retrieval and extraction system for biological literature.

Hans-Michael Müller,Eimear E Kenny,Paul W Sternberg,Michael Ashburner Michael Ashburner

doi:10.1371/journal.pbio.0020309

Hans-Michael Müller, Eimear E Kenny + Show 2 more

Open Access

https://doi.org/10.1371/journal.pbio.0020309

Copy DOI

Abstract

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.

Highlights

Text-mining tools have become indispensable for the biomedical sciences
The labels fall into 33 categories that comprise the Textpresso ontology
We built a database of 3,800 C. elegans papers, bibliographic information from WormBase, abstracts of C. elegans meetings and the Worm Breeder’s Gazette, and some additional links and WormBase entities

Summary

Introduction

Text-mining tools have become indispensable for the biomedical sciences. The increasing wealth of literature in biology and medicine makes it difficult for the researcher to keep up to date with ongoing research. This problem is worsened by the fact that researchers in the biomedical sciences are turning their attention from small-scale projects involving only a few genes or proteins to large-scale projects including genome-wide analyses, making it necessary to capture extended biological networks from literature. Most information of biological discovery is stored in descriptive, full text. Distilling this information from scientific papers manually is expensive and slow, if the full text is available to the researcher at all. We wanted to develop a useful text-mining tool for full-text articles that allows an individual biologist to locate efficiently information of interest

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS Biology	Publication Date: Sep 21, 2004
Citations: 644	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Textpresso: an ontology-based information retrieval and extraction system for biological literature.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Biology

Lead the way for us

Similar Papers

Full-text articles, manuscript tracking, and automated literature search lead new array of AJKD online features.
Neil A Kurtzman
American journal of kidney diseases : the official journal of the National Kidney Foundation | VOL. 34
Neil A KurtzmanNeil A Kurtzman
01 Dec 1999
American journal of kidney diseases : the official journal of the National Kidney Foundation | VOL. 34

Combining Results of Multiple Search Engines in Proteomics
David Shteynberg ... Eric W Deutsch
Molecular & Cellular Proteomics | VOL. 12
David Shteynberg, et. al.David Shteynberg ... Eric W Deutsch
01 Sep 2013
Molecular & Cellular Proteomics | VOL. 12

CYBER NEWS
Joe Hoyle
The Journal of the American Dental Association | VOL. 135
Joe HoyleJoe Hoyle
01 Oct 2004
The Journal of the American Dental Association | VOL. 135

Smart(er) Citations
Joshua M Nicholson
Matter | VOL. 4
Joshua M NicholsonJoshua M Nicholson
01 Mar 2021
Matter | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Textpresso: an ontology-based information retrieval and extraction system for biological literature.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Biology