Abstract
Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.
Highlights
Few technologies hold as much promise for the social sciences and humanities as optical character recognition (OCR)
Pre-trained, general Optical Character Recognition (OCR) processors have a much higher potential for wide adoption in the scholarly community, and their out-of-the box performance is of scientific interest
General OCR processors have struggled with non-Western languages ([16]), rendering them less useful for the many scholars working on documents in such languages
Summary
Few technologies hold as much promise for the social sciences and humanities as optical character recognition (OCR). Pre-trained, general OCR processors have a much higher potential for wide adoption in the scholarly community, and their out-of-the box performance is of scientific interest. General OCR processors such as Tesseract ([27, 38]) only delivered perfect results under what we may call laboratory conditions, i.e., on noise-free, single-column text in a clear printed font. This limited their utility for real-life historical documents, which often contain shading, blur, shine-through, stains, skewness, complex layouts, and other things that produce OCR error. General OCR processors have struggled with non-Western languages ([16]), rendering them less useful for the many scholars working on documents in such languages
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.