Abstract

Words in a natural language corpus obey a Zipfian distribution, which means only a minority percentage of the total are frequent but the vast majority are rare. Word embedding techniques heavily rely on the frequencies of words in the corpus and cannot provide reliable representations for words appearing infrequently during training. To address this problem, we propose a novel algorithm to induce embeddings for rare words by leveraging morphological decomposition, stemming and bidirectional translation. Compared to existing approaches, our algorithm maintains a relatively lightweight model but generates qualified representations for a wider range of vocabulary from the same corpus. We have performed evaluations on multiple general domain datasets and specific domain datasets, and the experimental results show that our algorithm achieves better performance than the other state-of-the-art techniques.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call