A Maximum-Entropy approach for accurate document annotation in the biomedical domain

George Tsatsaronis,Heiko Dietze,Natalia Macari,Michael Schroeder,Sunna Torge

doi:10.1186/2041-1480-3-s1-s2

George Tsatsaronis, Heiko Dietze + Show 3 more

Open Access

https://doi.org/10.1186/2041-1480-3-s1-s2

Copy DOI

Abstract

The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm’s performance is resilient to terms’ ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.

Highlights

Introduction and motivationWith the rapid expansion of the internet as a means of retrieving related scientific and educational literature, the search for relevant information has become a difficult and time consuming process
Some representative examples of such search engines for the biomedical domain are: (a) GoPubMed which uses the Gene Ontology (GO) and the Medical Subject Headings (MeSH) as background knowledge for indexing the biomedical literature stored in the PubMed database, and various text mining techniques and algorithms for the identification of relevant ontology entities in PubMed abstracts, (b) semedico, which provides access to semantic metadata about abstracts indexed in PubMed using the JULIE Lab text mining engine and MeSH as a knowledge base, and (c) novoseek, which uses external available data and contextual term information to identify key biomedical terms in biomedical literature documents
Overview of the suggested approach and summary of the results Taking into consideration the findings of the previous sections, in this work we present a novel approach based on Maximum Entropy, that may annotate biomedical literature documents with MeSH terms automatically, and with very high F-Measure

Summary

Introduction

Introduction and motivationWith the rapid expansion of the internet as a means of retrieving related scientific and educational literature, the search for relevant information has become a difficult and time consuming process. Nlm.nih.gov/pubmed/), and various text mining techniques and algorithms (stemming, tokenization, synonym detection) for the identification of relevant ontology entities in PubMed abstracts, (b) semedico (http://www.semedico.org), which provides access to semantic metadata about abstracts indexed in PubMed using the JULIE Lab text mining engine (http://www.julielab.de) and MeSH as a knowledge base, and (c) novoseek (http://www.novoseek.com), which uses external available data and contextual term information to identify key biomedical terms in biomedical literature documents. In the section Approach, we explain in detail how the suggested Maximum Entropy-based approach operates, as well as why we selected it among other learning alternatives, with the most representative reasons being robustness and provision of the importance of the features that are most representative for each of the learned classes This latter property gives interpret ability to the learned models. A formulation of the problem is presented, as well as a summary of the suggested approach and of the reported results

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Biomedical Semantics	Publication Date: Jan 1, 2012
Citations: 22	License type: cc-by

R Discovery Prime

R Discovery Prime

A Maximum-Entropy approach for accurate document annotation in the biomedical domain

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Semantics

Lead the way for us

Similar Papers

Not just keywords but MeSH keywords: Do mention for better visibility of your publication.
Manisha D Katikar ... Vanita Ahuja
Indian Journal of Anaesthesia | VOL. 67
Manisha D Katikar, et. al.Manisha D Katikar ... Vanita Ahuja
01 Mar 2023
Indian Journal of Anaesthesia | VOL. 67

The impact of MeSH (Medical Subject Headings) terms on information seeking effectiveness
Ying-Hsang Liu
ACM SIGIR Forum | VOL. 43
Ying-Hsang LiuYing-Hsang Liu
14 Dec 2009
ACM SIGIR Forum | VOL. 43

GRiD: Gathering rich data from PubMed using one-class SVM
Junbum Cha ... Jeongwoo Kim
-
Junbum Cha, et. al.Junbum Cha ... Jeongwoo Kim
01 Oct 2016
01 Oct 2016

American College of Gastroenterology monograph on the management of irritable bowel syndrome and chronic idiopathic constipation.
Alexander C Ford ... Anthony J Lembo
American Journal of Gastroenterology | VOL. Suppl 109 1
Alexander C Ford, et. al.Alexander C Ford ... Anthony J Lembo
01 Aug 2014
American Journal of Gastroenterology | VOL. Suppl 109 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Maximum-Entropy approach for accurate document annotation in the biomedical domain

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Semantics