Abstract

In this paper we consider the task of extracting multiword expressions (MWE) for Russian thesaurus RuThes, which contains various types of phrases, including non-compositional phrases, multiword terms and their variants, light verb constructions, and others. We study several embedding-based features for phrases and their components and estimate their contribution to finding multiword expressions of different types comparing them with traditional association and context measures. We found that one of the distributional features has relatively high results of MWE extraction even when used alone. Different forms of its combination with other features (phrase frequency, association measures) improve both initial orderings. Besides, we demonstrate significant potential of an existing thesaurus for recognition of new multiword expressions for adding to the thesaurus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call