The BioLexicon: a large-scale terminological resource for biomedical text mining.

Paul Thompson,Simone Marchi,Nicoletta Calzolari,John Mcnaught,Cj Rupp,Riccardo Del Gratta,Dietrich Rebholz-Schuhmann,Piotr Pezik,Simonetta Montemagni,Monica Monachini,Sophia Ananiadou,Giulia Venturi,Yutaka Sasaki,Valeria Quochi,Vivian Lee

doi:10.1186/1471-2105-12-397

Abstract

BackgroundDue to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events.ResultsThis article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard.ConclusionsThe BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

Highlights

Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information
Whilst the majority of the term variants were extracted from databases and ontologies, 70,105 new variants of gene/protein names were extracted from texts using the text mining techniques described in the Methods section
The large numbers of variants that appear only in texts but not in existing databases provide evidence of the frequency with which new term variants are attested in articles, and that automatic text mining methods such as those described are an essential step for improving the coverage of the BioLexicon, and enhancing the performance of text mining systems that make use of it

Summary

Introduction

Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature Such resources typically do not provide information about how terms relate to each other in texts to describe events. Dyson at Big Think’s Farsight 2011: Beyond the Searchbox event [5])

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 12, 2011
Citations: 101	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

The BioLexicon: a large-scale terminological resource for biomedical text mining.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Single Concatenated Input is Better than Indenpendent Multiple-input for CNNs to Predict Chemical-induced Disease Relation from Literature
Bui Manh Thang ... Pham Thi Quynh Trang
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 36
Bui Manh Thang, et. al.Bui Manh Thang ... Pham Thi Quynh Trang
30 May 2020
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 36

Text Mining for Bioinformatics Using Biomedical Literature
Andre Lamurias ... Francisco M Couto
Reference Module in Life Sciences | VOL. -
Andre Lamurias, et. al.Andre Lamurias ... Francisco M Couto
01 Jan 2024
Reference Module in Life Sciences | VOL. -

A Variety of Text Mining Technology and Tools Research
Jie Lian ... Zhili Pei
-
Jie Lian, et. al.Jie Lian ... Zhili Pei
01 Jan 2014
01 Jan 2014

Biomedical Event Extraction Using Convolutional Neural Networks and Dependency Parsing
Jari Björne ... Tapio Salakoski
-
Jari Björne, et. al.Jari Björne ... Tapio Salakoski
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The BioLexicon: a large-scale terminological resource for biomedical text mining.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics