Abstract

The main aim of the Carabela project was to develop and apply techniques that allow textual searching on massive Spanish collections of 15th-19th century manuscripts. The project focused on a relatively small subset of 125 000 images of collections of interest to underwater archaeology. For this type of manuscripts, state-of-the-art automatic transcription techniques, generally fail to achieve usable transcription accuracy. Therefore, rather than insisting in actual transcription, methodologies for probabilistic indexing of handwritten text images have been adopted. This has allowed us to effectively cope with the intrinsically high degree of uncertainty of the text contained in most historical manuscripts, leading to highly effective systems for textual search and retrieval. Carabela has gone one step further by developing new techniques to classify probabilistically indexed, but otherwise untranscribed, text images according to their textual content. These techniques have been successfully used to automatically classify Carabela bundels (each containing hundreds or thousands of pages) according to their “level of risk” of public exposure, in order to control their access and avoid as much as possible the plundering of Spanish underwater heritage.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call