Abstract

Knowing the Most Frequent Sense (MFS) of a word has been proved to help Word Sense Disambiguation (WSD) models significantly. However, the scarcity of sense-annotated data makes it difficult to induce a reliable and high-coverage distribution of the meanings in a language vocabulary. To address this issue, in this paper we present CluBERT, an automatic and multilingual approach for inducing the distributions of word senses from a corpus of raw sentences. Our experiments show that CluBERT learns distributions over English senses that are of higher quality than those extracted by alternative approaches. When used to induce the MFS of a lemma, CluBERT attains state-of-the-art results on the English Word Sense Disambiguation tasks and helps to improve the disambiguation performance of two off-the-shelf WSD models. Moreover, our distributions also prove to be effective in other languages, beating all their alternatives for computing the MFS on the multilingual WSD tasks. We release our sense distributions in five different languages at https://github.com/SapienzaNLP/clubert.

Highlights

  • Word Sense Disambiguation (WSD) is the task of associating a word in context with a meaning from a given inventory of senses (Navigli, 2009)

  • We investigate the capabilities of CluBERT to scale over different languages by evaluating it on the multilingual Word Sense Disambiguation tasks of SemEval-2013* and SemEval-2015*

  • We assess CluBERT Most Frequent Sense (MFS) effectiveness when used as backoff strategy in two off-the-shelf WSD approaches, i.e., UKB and the BiLSTM with attention model presented by Raganato et al (2017b)

Read more

Summary

Introduction

Word Sense Disambiguation (WSD) is the task of associating a word in context with a meaning from a given inventory of senses (Navigli, 2009). Current approaches to WSD can mainly be divided into supervised and knowledge-based methods While the former leverage manually-annotated data to train statistical models, the latter exploit the knowledge enclosed within a semantic network to identify the most appropriate meaning of a word in context. Since words and senses follow a Zipfian distribution (McCarthy et al, 2004a), information on rare words and meanings is scarce in both semantically-annotated data and knowledge bases This undermines the ability of supervised and knowledge-based approaches to deal with words unseen at training time, or that have only a few connections within a semantic network. The WordNet most frequent sense for the noun pipe is its smoking device meaning, nowadays, one would expect the metal pipe sense to appear more often in general

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call