Improving Word Embeddings Using Kernel PCA

Vishwani Gupta,Stefan Rüping,Sven Giesselbach,Christian Bauckhage

doi:10.18653/v1/w19-4323

Abstract

Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.

Highlights

Continuous vector representations of words learned from unstructured text corpora are an effective way of capturing semantic relationships among words
We propose pre-training embeddings with a kernel PCA computed on word similarity matrices, generated using a string similarity function, for words in a vocabulary and injecting the pre-trained embeddings in the Word2Vec and fastText embeddings by initializing them with the kernel principal component analysis (KPCA) word and subword embeddings
To compare how well our models perform in comparison to the original skip-gram model and fastText model, we consider both of them as the baselines in all our experiments and use the same parameters and datasets for generating and evaluating embeddings for all models

Summary

Introduction

Continuous vector representations of words learned from unstructured text corpora are an effective way of capturing semantic relationships among words. Approaches to computing word embeddings are typically based on the context of words, their morphemes, or corpus-wide cooccurrence statistics. As of this writing, arguably the most popular approaches are the Word2Vec skip-gram model (Mikolov et al, 2013a) and the fastText model (Bojanowski et al, 2017). The skip-gram model generates embeddings based on windowed word contexts While it incorporates semantic information, it ignores word morphology. FastText improves the results by incorporating subword information, it still fails in many cases This is evident in the news domain where frequently new words such as names occur over time which, in turn, impacts the performance of downstream applications. Research questions we answer in this paper are: 1. Can high-quality word embeddings be trained on small datasets?

Can high-quality embeddings be generated for infrequent words?

Related work

KPCA-based skip-gram and fastText models

Kernel PCA on string similarities

Models with KPCA embeddings

Experimental Results

20 Newsgroups Text8

Baseline

Evaluation

Word similarity Evaluation

Word Analogy Evaluation

20 Newsgroups english Wiki 2016

Evaluation of performance on downstream applications

Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Word Embeddings Using Kernel PCA

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 14	License type: cc-by

Similar Papers

One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages
Vít Novotný ... Dávid Lupták
-
Vít Novotný, et. al.Vít Novotný ... Dávid Lupták
01 Jan 2020
01 Jan 2020

CLexIS2: A New Corpus for ComplexWord Identification Research in Computing Studies
Jenny A Ortiz-Zambrano ... Arturo Montejo-Raéz
-
Jenny A Ortiz-Zambrano, et. al.Jenny A Ortiz-Zambrano ... Arturo Montejo-Raéz
01 Jan 2020
01 Jan 2020

Improving document representation using KPCA and clustered word embeddings
Aakansha Gupta ... Rahul Katarya
-
Aakansha Gupta, et. al.Aakansha Gupta ... Rahul Katarya
10 Dec 2021
10 Dec 2021

Calibrating GloVe model on the principle of Zipf’s law
Xuefei Cao ... Junfeng Shi
Pattern Recognition Letters | VOL. 125
Xuefei Cao, et. al.Xuefei Cao ... Junfeng Shi
01 Jul 2019
Pattern Recognition Letters | VOL. 125

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Word Embeddings Using Kernel PCA

Abstract

Highlights

Summary

Talk to us

Similar Papers