Abstract

We present a dataset of named entities in three languages: Medieval Latin, Middle High German and Old Norse. The dataset, containing proper nouns of persons and places, was originally created to extract characters from three related medieval texts. Since the annotation is on low-resource pre-modern languages, they may be important to build named-entity recognition tools for languages with little data and high linguistic variation.

Highlights

  • The annotations were originally from a character-network analysis paper (Besnier, 2020)

  • We tokenized the texts with the CLTK package (Johnson et al, 2021) and picked the tokens that start with a capital letter, since this often marks proper nouns

  • The annotations are lists of lemmata of proper nouns in three texts: the Decem Libri Historium by Gregory of Tours written in Medieval Latin, the Völsunga saga written in Old Norse (ON), and the Nibelungenlied written in Middle High German (MHG)

Read more

Summary

OVERVIEW

The aim was to compare the evolution of character sets in stories with similar backgrounds over time and space

METHOD
DATASET DESCRIPTION
REUSE POTENTIAL
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call