Abstract

Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent years, an approach based on probing—analysis of the impact of small changes in the model on the result—has been actively developed. In this paper, we propose a new method for interpreting the arrangement of words in a vector space, applicable for the high-level interpretation of the entire space as a whole. The method provides for identifying the main directions which are selecting large groups of words (about a third of all the words in the model’s dictionary) and opposing them by some semantic features. The method allows us to build a shallow hierarchy of such features. We conducted our experiments on three models trained in different corpora: Russian National Corpus, Araneum Russicum and a collection of scientific articles from different subject domains. For our experiments, we used only nouns from the models’ dictionaries. The article considers an expert interpretation of such division up to the third level. The set of selected features and their hierarchy differ from model to model, but they have a lot in common. We have found that the identified semantic features depend on the texts comprising a corpus used for the model training, their subject domain, and style. The resulting division of words does not always correlate with the common sense used for ontology development. For example, one of the coinciding features is the abstract or material nature of the object. However, at the upper level of models, words are divided into everyday/special lexis, archaic lexis, proper names and common nouns. The article provides examples of words included in the derived groups.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call