Abstract

Word Sense Disambiguation (WSD) is a key task of automatic semantic analysis that affects other upstream tasks. Nevertheless, the selection of appropriate sense of ambiguous word in context is a complicated task even for human native speakers. It is even more relevant for automatic disambiguation models. That is why we need any observations and heuristics able to make the WSD task simpler or performance higher. Researchers have noticed that the distribution of ambiguous word senses follow certain laws. In the paper we discuss three hypotheses about word senses distribution in corpus: 1) Most Frequent Sense, MFS; 2) One Sense per Discourse and 3) One Sense per Collocation. The following results were obtained on the material of a corpus of Russian texts. Most Frequent Sense based algorithm demonstrates relatively high precision on both training and test set (85.7% and 71.1% respectively). The One Sense per Discourse hypothesis has been confirmed in 93% of texts. The One Sense per Collocation hypothesis has been confirmed in 84.46% word pairs from texts. The exceptions are related to difficulties and errors by manual word sense labeling. Heuristics based on uneven distribution of ambiguous words in corpus allow to make WSD task simpler and can be used by collecting training data sets for WSD models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call