Abstract
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.
Highlights
Learning continuous representations of words has a long history in natural language processing (Rumelhart et al, 1988)
For English, we use the WS353 dataset introduced by Finkelstein et al (2001) and the rare word dataset (RW), introduced by Luong et al (2013)
We investigate a simple method to learn word representations by taking into account subword information
Summary
Learning continuous representations of words has a long history in natural language processing (Rumelhart et al, 1988). Mikolov et al (2013b) proposed simple log-bilinear models to learn continuous representations of words on very large corpora efficiently. Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Transactions of the Association for Computational Linguistics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.