Abstract

When optical character recognition (OCR) devices process the same page image, they generate similar text strings. Differences are due to recognition errors. A page of text rarely contains long repeated substrings; therefore, N strings generated by OCR devices can be quickly matched by detecting long common substrings. An algorithm for matching an arbitrary number of strings based on this principle is presented. Although its worst-case performance is O(Nn2), its performance in practice has been observed to be O(Nn log n), where n is the length of a string. This algorithm has been successfully used to study OCR errors, to determine the accuracy of OCR devices, and to implement a voting algorithm.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call