Abstract

The choice of a commercial optical character recognition (OCR) engine is important for the process of automatically indexing technical drawings from their title blocks. We would like to benchmark commercial OCR engines with respect to their inclusion in the global digitalisation chain from scanning to understanding the text information contained in a technical drawing document. The crucial (costly) point is the manual correction of OCR recognition errors. By benchmarking, we intend to identify, for our application domain, the causes for OCR errors which are the most costly to correct. For a given OCR engine, we model the correction cost as a function of image characteristics. Thus, our methodology relies on the two following issues: on the one hand, the design of the correction cost, representing the difficulty of correction for a human operator; on the other hand, the classification of image characteristics that may lead to OCR recognition errors. We choose to analyse the behaviour of this correction cost by principal component analysis (PCA), comparing two by two the engines to discover their complementarity. This methodology allows us to obtain a list of domain-dependant problems for OCR engines, classified by importance with respect to the correction cost. This list could then be used to correctly choose the OCR engine, or to enhance the OCR execution, by focusing on the most important problems. While we are confident it could easily be implemented for other document classes, we apply this methodology to the domain of technical drawings, and find that our OCR engines are not adapted to our problem.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call