Abstract

Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.

Highlights

  • The relevance of quantitative-statistical methods for the description of the variation of English has increased rapidly during the last decades

  • The more than 200,000 eighteenth-century documents included in Eighteenth Century Collections Online (ECCO) amount to a little over 50 per cent of what is included in English Short-Title Catalogue (ESTC), the most comprehensive metadata collection of the British publication record for the early modern period (1470– 1800)

  • If we look at the geographical distribution of works in ECCO (Table 1), we quickly realise that especially items printed in the US are heavily underrepresented in the collection, compared most importantly to Scotland and Ireland

Read more

Summary

INTRODUCTION

The relevance of quantitative-statistical methods for the description of the variation of English has increased rapidly during the last decades (cf. Gries 2012). There are good reasons to take ECCO as the basis of studies on language variation It is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. The need for large-scale harmonisation has been widely recognised, and various solutions that are relevant to corpus linguistics are already available or have been proposed for the processing of digitised texts and other data types (Mäkelä et al 2020). According to Davies (2012: 172) the main problems with large text archives (such as ECCO) are “accuracy, annotation, architecture, availability, and genre balance between different time periods.”. We will look at availability, architecture, genre balance and the accuracy in terms of OCR quality We weigh these aspects of ECCO and its use in corpus linguistics from different perspectives and especially with respect to selection of corpora.

ANALYSIS
Reprints
Subject headings
CONCLUSION
Findings
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call