Abstract

We present a novel approach to improve the output of optical character recognition (OCR) systems by first detecting duplicate passages in their output and then performing consensus decoding combined with a language model. This approach is orthogonal to, and may be combined with, previously proposed methods for combining the output of different OCR systems on the same image or the output of the same OCR system on differently processed images of the same text. It may also be combined with methods to estimate the parameters of a noisy channel model of OCR errors. Additionally, the current method generalizes previous proposals for a simple majority- vote combination of known duplicated texts. On a corpus of historical newspapers, an annotated set of clusters has a baseline word error rate (WER) of 33%. A majority vote procedure reaches 23% on passages where one or more duplicates were found, and consensus decoding combined with a language model achieves 18% WER. In a separate experiment, newspapers were aligned to very widely reprinted texts such as State of the Union speeches, producing clusters with up to 58 witnesses. Beyond 20 witnesses, majority vote outperforms language model rescoring, though the gap between them is much less in this experiment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call