Abstract
The Proceedings of the Old Bailey is a corpus of over 180,000 page images of court records printed from April 1674 to April 1913 and presents a comprehensive challenge for Optical Character Recognition (OCR) services. The Old Bailey is an ideal benchmark for historical document OCR, representing more than two centuries of variations in documents, including spellings, formats, and printing and preservation qualities. In addition to its historical and sociological significance, the Old Bailey is filled with imperfections that reflect the reality of coping with large-scale historical data. Most importantly, the Old Bailey contains human transcriptions for each page, which can be used to help measure OCR accuracy. Since humans do make mistakes in transcriptions, the relative performance of OCR services will be more informative than their absolute performance. This paper compares three leading commercial OCR cloud services: Amazon Web Services's Textract (AWS); Microsoft Azure's Cognitive Services (Azure); and Google Cloud Platform's Vision (GCP). Benchmarking involved downloading over 180,000 images, executing the OCR, and measuring the error rate of the OCR text against the human transcriptions. Our results found that AWS had the lowest median error rate, Azure had the lowest median round trip time, and GCP had the best combination of a low error rate and a low duration.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.