Abstract

BackgroundWe aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort.MethodsWe build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters.ResultsOn a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system.ConclusionOur system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

Highlights

  • We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort

  • They built probabilistic models using parallel corpus with an unsupervised approach. They demonstrated that the concept model improved performance on the word sense disambiguation task over the previous approaches participated in 21 Senseval-2 English All Words competition

  • We found that SENSATIONAL requires at least 30 documents per sense in order to determine accurate clusters

Read more

Summary

Introduction

We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Supervised systems require multiple examples of each sense of a word in context, manually labeled with the correct sense. From this data, a supervised system can learn to predict the correct sense of the same word in a new context. A supervised system can learn to predict the correct sense of the same word in a new context These data sets are labor-intensive, time-consuming, and expensive to produce, and far have been produced for only several dozen ambiguous terms. It is impractical to scale this kind of technique to all terms used in the biomedical literature

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.