Abstract
In the natural language environment, accurately inferring the meaning of a token according to its context is crucial to understanding a sophisticated expression. However, this is not easy for a machine. The traditional language models used to train distributed word vectors are often restricted by single-sense embedding. In this paper, we develop a model called MSCvec (Multi-sense Soft Clustering Vector) for word sense disambiguation of polysemy in context. We extract the features of individual words by the co-occurrence PPMI (Positive Pointwise Mutual Information) matrix, and decompose the matrix by NMF (Nonnegative Matrix Factorization) into low-rank representations of target words, which are used as the input of an unsupervised sparse soft clustering method called Sparse Fuzzy C-means (SFCM). We use SFCM to determine the global semantic space of words, and partition the subspaces of multiple senses of a polysemous word. We relabel candidate words by the negative average log likelihood, and train multi-sense embedding with extensional vocabulary by the fastText model. Compared with the traditional static embeddings, the result shows that NMF and SFCM design can improve the performance in word similarity and relatedness tasks as well as in text classification tasks of different types of text. Accurate semantic representation of MSCvec would be necessary to produce outstanding results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Pattern Recognition and Artificial Intelligence
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.