OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Thomas Hegghammer

doi:10.1007/s42001-021-00149-1

Thomas Hegghammer

Open Access

https://doi.org/10.1007/s42001-021-00149-1

Copy DOI

Abstract

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Highlights

Few technologies hold as much promise for the social sciences and humanities as optical character recognition (OCR)
Pre-trained, general Optical Character Recognition (OCR) processors have a much higher potential for wide adoption in the scholarly community, and their out-of-the box performance is of scientific interest
General OCR processors have struggled with non-Western languages ([16]), rendering them less useful for the many scholars working on documents in such languages

Summary

Introduction

Few technologies hold as much promise for the social sciences and humanities as optical character recognition (OCR). Pre-trained, general OCR processors have a much higher potential for wide adoption in the scholarly community, and their out-of-the box performance is of scientific interest. General OCR processors such as Tesseract ([27, 38]) only delivered perfect results under what we may call laboratory conditions, i.e., on noise-free, single-column text in a clear printed font. This limited their utility for real-life historical documents, which often contain shading, blur, shine-through, stains, skewness, complex layouts, and other things that produce OCR error. General OCR processors have struggled with non-Western languages ([16]), rendering them less useful for the many scholars working on documents in such languages

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computational Social Science	Publication Date: Nov 22, 2021
Citations: 33	License type: open-access

R Discovery Prime

R Discovery Prime

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computational Social Science

Lead the way for us

Similar Papers

An End-to-End Scene Text Recognition for Bilingual Text
Bayan M Albalawi ... Lama A Al Khuzayem
Big Data and Cognitive Computing | VOL. 8
Bayan M Albalawi, et. al.Bayan M Albalawi ... Lama A Al Khuzayem
09 Sep 2024
Big Data and Cognitive Computing | VOL. 8

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.
Asghar Ali Chandio ... Mehwish Leghari
Data in Brief | VOL. 31
Asghar Ali Chandio, et. al.Asghar Ali Chandio ... Mehwish Leghari
21 May 2020
Data in Brief | VOL. 31

Arabic Text Recognition and Machine Translation
Ihab Alkhoury
-
Ihab AlkhouryIhab Alkhoury
13 Jul 2015
13 Jul 2015

Practical vision based degraded text recognition system
Khader Mohammad ... Sos Agaian
-
Khader Mohammad, et. al.Khader Mohammad ... Sos Agaian
10 Feb 2011
10 Feb 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computational Social Science