The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Lassi Saario,Terttu Nevalainen,Samuli Kaislaniemi,Tanja Säily

doi:10.32714/ricl.09.01.07

Lassi Saario, Terttu Nevalainen + Show 2 more

Open Access

https://doi.org/10.32714/ricl.09.01.07

Copy DOI

Abstract

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.

Full Text