Named-Entity Dataset for Medieval Latin, Middle High German and Old Norse

Clément Besnier,William Mattingly

doi:10.5334/johd.36

Abstract

We present a dataset of named entities in three languages: Medieval Latin, Middle High German and Old Norse. The dataset, containing proper nouns of persons and places, was originally created to extract characters from three related medieval texts. Since the annotation is on low-resource pre-modern languages, they may be important to build named-entity recognition tools for languages with little data and high linguistic variation.

Highlights

The annotations were originally from a character-network analysis paper (Besnier, 2020)
We tokenized the texts with the CLTK package (Johnson et al, 2021) and picked the tokens that start with a capital letter, since this often marks proper nouns
The annotations are lists of lemmata of proper nouns in three texts: the Decem Libri Historium by Gregory of Tours written in Medieval Latin, the Völsunga saga written in Old Norse (ON), and the Nibelungenlied written in Middle High German (MHG)