Abstract

Compared with traditional methods, word em-bedding is an efficient language representation that can learn syntax and semantics by using neural networks. As the result, more and more promising experiments in natural language processing (NLP) get the state-of-the-art results by introducing word embedding. In principle, embedding representation learning embeds words to a low-dimensional vector space, there-fore vectors support initialization of NLP tasks such as text classification, sentiment analysis, language understanding, etc. However, polysemy is very common in many languages, which causes word ambiguation, further influences the accuracy of the system. Additionally, language models based on distributed hypotheses mostly focused on word properties rather than morphology were our primary focus. This leads to unreasonable performance in different evaluations. At the same time, word embedding learning and measuring are two vital components of word representation. In this paper, we overviewed many language models including single sense and multiple sense word embedding, and many evaluated approaches including intrinsic and extrinsic evaluation. We found that there are obvious gaps between vectors and manual annotations in word similarity evaluation, and language models that achieved good performance in intrinsic evaluations could not produce similar results in extrinsic evaluations. To the best of our knowledge, there is no universal language model and embedding learning method for most NLP task, and each evaluations also hidden natural defects compared to human knowledge. More evaluated datasets are also investigated such as datasets used in intrinsic and extrinsic evaluations. We believe that an improved evaluation dataset and a more rational evaluation method would benefit from this overview.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call