Abstract

We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked—whenever possible—to their corresponding entry on Wikipedia. The dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to Wikipedia. The dataset is published in the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content.

Highlights

  • We present a new dataset for the task of toponym resolution in digitized historical newspapers in English

  • Some entity linking datasets have been created to address this issue, such as Ehrmann et al (2020) and Hamdi et al (2021), both built from digitized historical newspaper collections

  • This dataset is comprised of 343 articles carefully sampled from a variety of provincial nineteenth-century newspapers based in four different locations in England

Read more

Summary

OVERVIEW

We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. The dataset has been created with the aim of becoming a benchmark for several tasks: fuzzy string matching and toponym recognition and resolution, among others, all of which contribute to the challenging pursuit of improving semantic access to OCRed historical texts in English. This dataset has been produced as part of Living with Machines, a multidisciplinary research project focused on the lived experience of industrialization in Britain during the long nineteenth century and, in particular, on the social and cultural impact of mechanization as reported in newspapers and other sources. Living with Machines is one of many projects that harness the growing volume of digitized newspaper collections for humanities research. A fraction of the annotated data has been used in previous studies from Living with Machines, in particular Coll Ardanuy et al (2019), and for fuzzy string matching in Hosseini, Nanni, and Coll Ardanuy (2020) and Coll Ardanuy et al (2020)

METHOD
DATASET DESCRIPTION
REUSE POTENTIAL
FUNDING STATEMENT
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.