Abstract

Searching for the best sense for a polysemous word remains one of the greatest challenges in the representation of biomedical text. To this end, Word Sense Disambiguation (WSD) algorithms mostly rely on an External Source of Knowledge, like a Thesaurus or Ontology, for automatically selecting the proper concept of an ambiguous term in a given Window of Context using semantic similarity and relatedness measures. In this paper, we propose a Web-based Kernel function for measuring the semantic relatedness between concepts to disambiguate an expression versus multiple possible concepts. This measure uses the large volume of documents returned by PubMed Search engine to determine the greater context for a biomedical short text through a new term weighting scheme based on Rough Set Theory (RST). To illustrate the efficiency of our proposed method, we evaluate a WSD algorithm based on this measure on a biomedical dataset (MSH-WSD) that contains 203 ambiguous terms and acronyms. The obtained results demonstrate promising improvements.

Highlights

  • Information amount in the medical field has grown exponentially with more than 23 million published citations listed in the MedLine database and available via PubMed

  • This measure uses the large volume of documents returned by PubMed search engine to determine the greater context for a biomedical short text through a new term weighting scheme based on rough set theory (RST)

  • We compute the similarity between the context of the word to be mapped and the different corresponding concepts; the concept with the greatest similarity is the one to be chosen. This measure uses the large volume of documents returned by PubMed search engine to determine the greater context for a biomedical short text through a new term weighting scheme based on Rough Set Theory (RST) which is a mathematical tool to deal with vagueness and uncertainty (Pawlak, 1991)

Read more

Summary

Introduction

Information amount in the medical field has grown exponentially with more than 23 million published citations listed in the MedLine database and available via PubMed. Many biomedical text-mining applications such as information retrieval, text categorization and machine translation aim to provide suitable solutions for this purpose. One of the major problems in these applications is the document’s representation where we still limited only by the terms or words that occur in the document. The usual way of representing a text is the Bag of Words (BoW) representation, which look at the surface word forms and ignore all semantic or conceptual information in the text. Biomedical documents make these issues even more serious, due to their sparseness and lexical ambiguity

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call