Learning robust word representation over a semantic manifold

Liyuan Zheng,Yajie Hu,Bin Liu,Wei Deng

doi:10.1016/j.knosys.2019.105358

Abstract

The performance of the traditional Word2Vec model heavily depends on the quality and quantity of the corpus, which violates the way of the humans learn. To understand word meaning, human beings prefer a two-stage learning process. That is, reading a Linguist-compiled dictionary as well as doing reading comprehension. These two stages complement each other. Traditional Word2Vec is an analogy of reading comprehension. While the first stage, learning the semantic rules from a language dictionary, such as the knowledge of thesaurus and etymology, is usually ignored by existing methods. In this work, we propose a robust word embedding learning framework by imitating the two-stage human learning process. In particular, we construct a semantic manifold based on the thesaurus and etymology to approximate the first stage. Then, we regularize the second stage (Word2Vec model) with this semantic manifold. We train the proposed model on three corpora (Wikipedia, enwik9 and text8). The experimental results demonstrate that the proposed method learns much smoother vector representations. Also, the performance on learning word embedding is robust even when the method is trained with a very simple corpus.

Full Text