Towards a broad-coverage graphemic analysis of large historical corpora

Sandra Waldenberger,Ilka Lemke,Stefanie Dipper

doi:10.1515/zfs-2021-2037

Sandra Waldenberger, Ilka Lemke + Show 1 more

Open Access

https://doi.org/10.1515/zfs-2021-2037

Copy DOI

Abstract

Abstract This paper presents a method which we are developing to explore graphemic variation in large historical corpora of German. Historical corpora provide an amount of data at the level of graphemics which cannot be handled exhaustively using common methods of manual evaluation. To deal with this challenge, we apply methods from computational linguistics to pave the way for a broad-coverage graph(em)ic analysis of large historical corpora. In this paper, we show how our approach can be applied to the Reference Corpus of Middle High German. Illustrating our method and linguistic analysis, we present findings from our investigations into diatopic and/or diachronic variation as documented in 13th and 14th century charters (Urkunden) from the corpus.

Highlights

The methods we present in this paper answer the call for semi-automatic means to analyze graphemic variation in historical texts
The graphemic level provides data sets that consist, on a basic level, of nothing else than character strings, which can be processed automatically. We use this fact to our advantage: The computational linguistic methods that we use are based on methods developed for normalizing historical spellings, i. e., for automatically mapping a historical spelling variant to a standardized form
On the level of historical graphemics, our goal is to map out the above-mentioned continuum of different ‘levels’ of variation in detail

Summary

Introduction

The methods we present in this paper answer the call for semi-automatic means to analyze graphemic variation in historical texts (cf. Elmentaler 2018: 335). Word-initial ko- from variety 1 might correspond to cho- in the other variety (as in chomen vs komen ‘come’) These mappings form the basis for our graphemic investigations. In Dipper and Waldenberger (2017), we applied the described methodology for the first time and examined mappings that were derived from a parallel corpus containing texts of different dialects from Early New High German, with large overlaps in vocabulary. The results of this pilot study were promising in that relevant variants could be automatically identified.

Generating difference profiles

Interpreting difference profiles

Text pairings reflecting diatopic variation

Text pairings reflecting diachronic variation

Statistically determined graphemic similarities

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Zeitschrift für Sprachwissenschaft	Publication Date: Nov 25, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Towards a broad-coverage graphemic analysis of large historical corpora

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Zeitschrift für Sprachwissenschaft

Lead the way for us

Similar Papers

Exploring and exploiting a historical corpus for Arabic
Bassam Hammo ... Sane Yagi
Language Resources and Evaluation | VOL. 50
Bassam Hammo, et. al.Bassam Hammo ... Sane Yagi
30 May 2015
Language Resources and Evaluation | VOL. 50

Historical Languages, Corpora, and Computational Methods
Barbara Mcgillivray
-
Barbara McgillivrayBarbara Mcgillivray
01 Jan 2014
01 Jan 2014

Extracting perceived landscape properties from text sources

-

08 Dec 2020
08 Dec 2020

Contrasts of Quantification in Lexical Series ‘Post, Poschenie, Postitisya — Alkanie, Alchba, Alkati’ in Old Russian Manuscripts (Historical Corpus ‘Manuscript’)
O F Zholobov
Nauchnyi dialog | VOL. 12
O F ZholobovO F Zholobov
04 Dec 2023
Nauchnyi dialog | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards a broad-coverage graphemic analysis of large historical corpora

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Zeitschrift für Sprachwissenschaft