Unsupervised Learning of Multi-Sense Embedding with Matrix Factorization and Sparse Soft Clustering

Fei Guo,Liangyan Li,Jing Xuan,Zhongshi He

doi:10.1142/s021800141951011x

Abstract

In the natural language environment, accurately inferring the meaning of a token according to its context is crucial to understanding a sophisticated expression. However, this is not easy for a machine. The traditional language models used to train distributed word vectors are often restricted by single-sense embedding. In this paper, we develop a model called MSCvec (Multi-sense Soft Clustering Vector) for word sense disambiguation of polysemy in context. We extract the features of individual words by the co-occurrence PPMI (Positive Pointwise Mutual Information) matrix, and decompose the matrix by NMF (Nonnegative Matrix Factorization) into low-rank representations of target words, which are used as the input of an unsupervised sparse soft clustering method called Sparse Fuzzy C-means (SFCM). We use SFCM to determine the global semantic space of words, and partition the subspaces of multiple senses of a polysemous word. We relabel candidate words by the negative average log likelihood, and train multi-sense embedding with extensional vocabulary by the fastText model. Compared with the traditional static embeddings, the result shows that NMF and SFCM design can improve the performance in word similarity and relatedness tasks as well as in text classification tasks of different types of text. Accurate semantic representation of MSCvec would be necessary to produce outstanding results.

Full Text