Abstract

The extraction of textual information from scanned document pages is a fundamental stage in any digitisation effort and directly determines the success of the overall document analysis and understanding application scenarios. To evaluate and improve the performance of optical character recognition (OCR), it is necessary to measure the accuracy of that step alone, without the influence of the processing steps that precede it (e.g. text block segmentation and ordering). Current OCR performance evaluation measures (based on edit distance) are strongly subjective as they need to first serialise the entire text in the documents – a process influenced heavily by the specific reading order determined (often wrongly, especially in cases of multicolumn and complex layouts) by processing steps prior to OCR. This paper presents a new objective and practical edit-distance-based character recognition accuracy measure which overcomes those limitations. It achieves its independence from the reading order by comparing sub-strings of text in a flexible way (i.e. allowing for ordering variations). The precision of the flexible character accuracy measure enables the effective tuning of complete digitisation workflows (as OCR errors are isolated and other steps can be evaluated and optimised separately). For the same reason, it also enables a better estimation of post-OCR (manual) correction effort required. The proposed character accuracy measure has been systematically analysed and validated under lab conditions as well as successfully used in practice in a number of high-profile international competitions since 2017.

Highlights

  • Document Recognition systems, known as Page Reading systems, play a crucial role in all digitisation efforts to extract and describe the information on scanned physical documents for further analysis and understanding

  • Since the ordering of the page content blocks identified by those preceding layout analysis steps is reflected in the input to the optical character recognition (OCR) step and the subsequent serialisation of the text recognised in those blocks, if the reading order is wrong the character accuracy measure can be very low, even if the actual recognition of each individual character is perfect

  • The traditional character accuracy measure drops to 25% in the worst case, despite all ground truth words being present in the OCR result in those examples

Read more

Summary

Introduction

Document Recognition systems, known as Page Reading systems, play a crucial role in all digitisation efforts to extract and describe the information on scanned physical documents for further analysis and understanding. The accuracy of the information extracted at this fundamental stage of digitisation directly determines the success of all subsequent analysis stages which construct higher-level semantic representations of the information contained in the documents. Starting with scanned pages as input, document recognition systems perform multiple processing steps, including layout analysis (region and text line segmentation) and optical character recognition (OCR). Performance evaluation is used for assessing and benchmarking different systems or methods (e.g. to choose the best one for a certain document collection or use case) or, at a lower level, when adapting a specific method (improving the method, parameter tuning, or training).

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call