Abstract

Hyphenated words are very frequent in historical manuscripts. Reliable recognition of (the prefix and suffix fragments of) these words is problematic and has not been sufficiently studied so far. If the aim is to transcribe text images, a sufficiently accurate character-level recognition of the fragments might be an admissible transcription result. However, if the goal is to allow searching for words or “keyword spotting”, this is not acceptable at all because users need to query entire words, rather than possible fragments of these words. The situation becomes even worse if the aim is to index images for lexicon-free searching for any arbitrary text. To start with, this makes it necessary to know whether the concatenation of two-word fragments may constitute a regular word, or each fragment is instead a word by itself. We propose a probabilistic model to deal with these complications and present a first development of this model, based only on lexicon-free probabilistic indexing of the text images. Albeit preliminary, it already allows to very accurately find both entire and hyphenated forms of arbitrary query words by using just the entire forms of the words. Experiments carried out on a representative part of a huge historical collection of the National Archives of Finland, confirm the usefulness of the proposed methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call