Subs2vec: Word embeddings from subtitles in 55 languages

Jeroen Van Paridon,Bill Thompson

doi:10.3758/s13428-020-01406-3

Jeroen Van Paridon, Bill Thompson

Open Access

https://doi.org/10.3758/s13428-020-01406-3

Copy DOI

Abstract

This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec.

Highlights

Recent progress in applied machine learning has resulted in new methods for efficient induction of high-quality numerical representations of lexical semantics—word vectors—directly from text
Results presented juxtapose three models generated by the authors using the same parametrization of the fastText skipgram algorithm: A wiki model trained on a corpus of Wikipedia articles, a subs model trained on the OpenSubtitles corpus, and a wiki+subs model trained on a combination of both corpora
Our aim in this study was to make available a collection of word embeddings trained on pseudo-conversational language in as many languages as possible using the same algorithm

Summary

Introduction

Recent progress in applied machine learning has resulted in new methods for efficient induction of high-quality numerical representations of lexical semantics—word vectors—directly from text. Pereira et al (2018)), and to predict human lexical judgements of e.g., word similarity, analogy, and concreteness (see Methods for more detail); and as models that help researchers gain quantitative traction on large-scale linguistic phenomena, such as semantic typology Garg et al (2018)), to give just a few examples Progress in these areas is rapid, but constrained by the availability of high quality training corpora and evaluation metrics in multiple languages. To meet this need for large, multilingual training corpora, word embeddings are often trained on Wikipedia, sometimes supplemented with other text scraped from web pages. This has the benefit that even obscure words and semantic relationships are often relatively well attested

Methods

Results

Conclusion