Abstract

Whole-book recognition is a document image analysis strategy that operates on the complete set of a book’s page images using automatic adaptation to improve accuracy. Our algorithm expects to be given approximate iconic and linguistic models-derived from (generally errorful) OCR results and (generally incomplete) dictionaries-and then, guided entirely by evidence internal to the test set, corrects the models yielding improved accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier. The linguistic model describes word-occurrence probabilities. In previous work, we reported that adapting the iconic model alone (with a perfect linguistic model) was able to automatically reduce word error rate on a 180-page book by a large factor. In this paper, we propose an algorithm that adapts both the iconic model and the linguistic model alternately to improve both models on the fly. The linguistic model adaptation method, which we report here, identifies new words and adds them to the dictionary. With 64.6% words missing in the initial dictionary, our previous algorithm reduced word error rate from 40.2% to 23.2%. The new algorithm drives word error rate down further from 23.2% to 16.0%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call