Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Mark J Hill,Simon Hengchen

doi:10.1093/llc/fqz024

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Mark J Hill, Simon Hengchen

Open Access

https://doi.org/10.1093/llc/fqz024

Copy DOI

Journal: Digital Scholarship in the Humanities	Publication Date: Apr 22, 2019
Citations: 57

Affiliation: University of Helsinki

#Eighteenth Century Collections Online #Optical Character Recognition + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

AbstractThis article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Digital Scholarship in the Humanities

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.