Abstract
Abstract The use of some basic computer science concepts could expand the possibilities of (manual) graphematic text corpus analysis. With these it can be shown that graphematic variation decreases constantly in printed German texts from 1600 to 1900. While the variability is continuously lesser on a text-internal level, it decreases faster for the whole available writing system of individual decades. But which changes took place exactly? Which types of variation went away more quickly, which ones persisted? How do we deal with large amounts of data which cannot be processed manually anymore? Which aspects are of special importance or go missing while working with a large textual base? The use of a measurement called entropy quantifies the variability of the spellings of a given word form, lemma, text or subcorpus, with few restrictions but also less details in the results. The difference between two spellings can be measured via Damerau-Levenshtein distance. To a certain degree, automated data handling can also determine the exact changes that took place. Afterwards, these differences can be counted and ranked. As data source the German Text Archive of the Berlin-Brandenburg Academy of Sciences and Humanities is used. It offers for example orthographic normalization – which is extremely useful –, preprocessing of parts of speech and lemmatization. As opposed to many other approaches the establishment of today’s normed spellings is not seen as the aim of the developments and is therefore not the focus of the research. Instead, the differences between individual spellings are of interest. Afterwards intra- and extralinguistic factors which caused these developments should be determined. These methodological findings could subsequently be used for improving research methods in other graphematic fields of interest, e. g. computer-mediated communication.
Highlights
Research in historical graphematics has come a long way over the last decades, even more so given the developments in data digitization
All the spelling pairs affected by a certain difference are considered collectively. This means that certain variations (e. g. of two graphemes) are observed over the whole data and allows for a generalization across specific word forms or spellings
This article has introduced a data-driven approach to the investigation of spelling variation
Summary
Research in historical graphematics has come a long way over the last decades, even more so given the developments in data digitization. Important research has already been conducted for specific phenomena like capitalization G. Bergmann and Nerius 1997; Dücker et al 2020) or the establishment of morphological spellings and stem constancy G. Ruge 2004; Voeste 2008a). There are more comprehensive analyses which, often only investigate a very narrow time span and/or are restricted to very specific authors, areas, or genres G. Moser 1977; Glaser 1985; Koller 1989 – just to name a few) There are more comprehensive analyses which, often only investigate a very narrow time span and/or are restricted to very specific authors, areas, or genres (e. g. Moser 1977; Glaser 1985; Koller 1989 – just to name a few)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.