Abstract

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

Highlights

  • Learning continuous representations of words has a long history in natural language processing (Rumelhart et al, 1988)

  • For English, we use the WS353 dataset introduced by Finkelstein et al (2001) and the rare word dataset (RW), introduced by Luong et al (2013)

  • We investigate a simple method to learn word representations by taking into account subword information

Read more

Summary

Introduction

Learning continuous representations of words has a long history in natural language processing (Rumelhart et al, 1988). Mikolov et al (2013b) proposed simple log-bilinear models to learn continuous representations of words on very large corpora efficiently. Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.