Abstract

For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a body of research seeks to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple simple thresholding binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold images and a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving an improvement over the original OCR output and a 8.41% WER result on this data set.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.