Abstract

Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.

Highlights

  • Continuous vector representations of words learned from unstructured text corpora are an effective way of capturing semantic relationships among words

  • We propose pre-training embeddings with a kernel PCA computed on word similarity matrices, generated using a string similarity function, for words in a vocabulary and injecting the pre-trained embeddings in the Word2Vec and fastText embeddings by initializing them with the kernel principal component analysis (KPCA) word and subword embeddings

  • To compare how well our models perform in comparison to the original skip-gram model and fastText model, we consider both of them as the baselines in all our experiments and use the same parameters and datasets for generating and evaluating embeddings for all models

Read more

Summary

Introduction

Continuous vector representations of words learned from unstructured text corpora are an effective way of capturing semantic relationships among words. Approaches to computing word embeddings are typically based on the context of words, their morphemes, or corpus-wide cooccurrence statistics. As of this writing, arguably the most popular approaches are the Word2Vec skip-gram model (Mikolov et al, 2013a) and the fastText model (Bojanowski et al, 2017). The skip-gram model generates embeddings based on windowed word contexts While it incorporates semantic information, it ignores word morphology. FastText improves the results by incorporating subword information, it still fails in many cases This is evident in the news domain where frequently new words such as names occur over time which, in turn, impacts the performance of downstream applications. Research questions we answer in this paper are: 1. Can high-quality word embeddings be trained on small datasets?

Can high-quality embeddings be generated for infrequent words?
Related work
KPCA-based skip-gram and fastText models
Kernel PCA on string similarities
Models with KPCA embeddings
Experimental Results
20 Newsgroups Text8
Baseline
Evaluation
Word similarity Evaluation
Word Analogy Evaluation
20 Newsgroups english Wiki 2016
Evaluation of performance on downstream applications
Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.