Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Alfredo Maldonado,Filip Klubička,John Kelleher

doi:10.1515/comp-2019-0009

Alfredo Maldonado, Filip Klubička + Show 1 more

Open Access

https://doi.org/10.1515/comp-2019-0009

Copy DOI

Abstract

AbstractWord embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.

Highlights

Word embeddings are vectors that capture the distributional semantic information of words in the corpora on which they are trained [1, 2]
We show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks
The x axis in these plots represents the size of the generated WordNet random walk pseudo-corpus in millions of tokens

Summary

Introduction

Word embeddings are vectors that capture the distributional semantic information of words in the corpora on which they are trained [1, 2] They have been shown to perform well on thematic similarity benchmarks [3], but have been less successful in stricter taxonomic and synonymic benchmarks [4, 5]. There have been recent efforts to incorporate explicit taxonomic information from lexical taxonomies, such as WordNet [6], into word embeddings [7, 8]. By contrast, have sought to construct true distributional embeddings on lexical taxonomies by traversing them in a random-walk fashion [10] These random-walk taxonomic embeddings outperform natural-corpus embeddings on strict taxonomic similarity benchmarks, such as SimLex-999 [4], a gold standard

Objectives

Methods

Results

Conclusion