Neural Language Models for Nineteenth-Century English

Kasra Hosseini,Mariona Coll Ardanuy,Giovanni Colavizza,Kaspar Beelen

doi:10.5334/johd.48

Abstract

We present four types of neural language models trained on a large historical dataset of books in English, published between 1760 and 1900, and comprised of ≈5.1 billion tokens. The language model architectures include word type embeddings (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the type embeddings, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential.

Highlights

As language is subject to continuous change, the computational analysis of digital heritage should attune models and methods to the specific historical contexts in which these texts
This paper aims to facilitate the “historicization” of Natural Language Processing (NLP) methods by releasing various language models trained on a 19th-century book collection
To accommodate different research needs, we release a wide variety of models, from word type embeddings to more recent language models that produce context-dependent word or string embeddings (BERT and Flair, respectively)

Summary

OVERVIEW

As language is subject to continuous change, the computational analysis of digital heritage should attune models and methods to the specific historical contexts in which these texts. This paper aims to facilitate the “historicization” of Natural Language Processing (NLP) methods by releasing various language models trained on a 19th-century book collection. These models can support research in digital and computational humanities, history, computational linguistics and the cultural heritage or GLAM sector (galleries, libraries, archives, and museums). To accommodate different research needs, we release a wide variety of models, from word type embeddings (word2vec and fastText) to more recent language models that produce context-dependent word or string embeddings (BERT and Flair, respectively). “contextual” models generate a distinct token embedding according to the textual context at inference time. The language models presented here have been used in several research projects, to assess the impact of optical character recognition (OCR) on NLP tasks (van Strien et al, 2020), to detect atypical animacy (Coll Ardanuy et al, 2020), and for targeted sense disambiguation (Beelen et al, 2021)

ORIGINAL CORPUS

LANGUAGE MODEL ZOO

REUSE POTENTIAL

FUNDING STATEMENT