Named entity recognition (NER) is the process of automatically identifying persons, places, organisations and other name-like entities in text, in order to perform natural language processing tasks such as automatic extraction of metadata from text, anonymisation/pseudonymisation of sensitive personal data, or as a preprocessing step for linking different terms describing the same entity to a single reference. While NER is a mature language technology, it is generally lacking for historical language varieties. We describe our work on compiling SWENER-1800, a large (half a million words) reference corpus of historical Swedish texts, covering the time period from the first half of the 18th century until about 1900, and manually annotating it with named entity types identified as significant for this time period, as well as with sentence boundaries, notoriously difficult to recognise automatically in historical text. This corpus can then be used to train and evaluate NER systems and sentence segmenters for historical Swedish text. An additional concrete contribution from this work is a manual for annotation of named entities in historical Swedish.
Read full abstract