Enriching Word Vectors with Subword Information

Piotr Bojanowski,Tomas Mikolov,Edouard Grave,Armand Joulin

doi:10.1162/tacl_a_00051

Piotr Bojanowski, Tomas Mikolov + Show 2 more

Open Access

https://doi.org/10.1162/tacl_a_00051

Copy DOI

Abstract

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

Highlights

Learning continuous representations of words has a long history in natural language processing (Rumelhart et al, 1988)
For English, we use the WS353 dataset introduced by Finkelstein et al (2001) and the rare word dataset (RW), introduced by Luong et al (2013)
We investigate a simple method to learn word representations by taking into account subword information

Summary

Introduction

Learning continuous representations of words has a long history in natural language processing (Rumelhart et al, 1988). Mikolov et al (2013b) proposed simple log-bilinear models to learn continuous representations of words on very large corpora efficiently. Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2017
Citations: 7415	License type: cc-by

R Discovery Prime

R Discovery Prime

Enriching Word Vectors with Subword Information

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Improving Word Embeddings Using Kernel PCA
Vishwani Gupta ... Christian Bauckhage
-
Vishwani Gupta, et. al.Vishwani Gupta ... Christian Bauckhage
01 Jan 2019
01 Jan 2019

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations
Fei Sun ... Xueqi Cheng
-
Fei Sun, et. al.Fei Sun ... Xueqi Cheng
01 Jan 2015
01 Jan 2015

An Adaptive Wordpiece Language Model for Learning Chinese Word Embeddings
Binchen Xu ... Qi Kang
-
Binchen Xu, et. al.Binchen Xu ... Qi Kang
01 Aug 2019
01 Aug 2019

A knowledge-enriched ensemble method for word embedding and multi-sense embedding
Lanting Fang ... Kaiqi Zhao
IEEE Transactions on Knowledge and Data Engineering | VOL. -
Lanting Fang, et. al.Lanting Fang ... Kaiqi Zhao
01 Jan 2021
IEEE Transactions on Knowledge and Data Engineering | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enriching Word Vectors with Subword Information

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics