Abstract

Word relevancy evaluation is an important part of natural language processing. This paper will take the word embedding as research object, use machine learning technology to conduct quantitative analysis of Chinese word relevancy. Word embedding is a way to express words by vector of real numbers. In addition to its advantages of low dimension and denseness, word embedding itself also carries the semantic information of corresponding words in context. For example, the cosine value between two word embeddings can reflect the semantic relevance of corresponding words to a certain extent. However, this paper finds that the mapping from cosine value to relevancy is not accurate enough through analyzing actual data. Therefore, this paper adopts a machine learning algorithm as Gradient Boosting Decision Tree (GBDT). Firstly, extract training samples from word embedding library, then build a training set by relevancy scoring to the target word. Secondly, build GBDT model based on the training set to conduct regression analysis of mapping from word embedding to relevancy score. Finally, fit the whole word embedding library with the resulting model. Experimental results show that comparing with the cosine value of word embedding, the GBDT regression model significantly improves the fitting values of the words closely related to the meaning of the target word, and the fitting values of words weakly related to the target word is effectively suppressed, which proves that the GBDT model performs better in accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call