Abstract

Content-based classification of manuscripts is an important task that is generally carried out by expert archivists. Nevertheless, many historical manuscript collections are so vast that in most cases this task is hardly feasible, even for large, well staffed archives. Nowadays, manuscripts are generally preserved in the form of sets of digital images. Therefore, the technical problem we are interested in is automatic classification of “‘image documents”, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The traditional Pattern Recognition classification paradigm does provide the basic tools to deal with this problem. However, in practice, the set of relevant classes of a large documental series is seldom known in advance. Therefore, a classifier trained with a predefined set of classes will systematically fail when new image documents arrive which do not belong to any of the classes assumed in training. Here we adopt the “Open Set Classification” framework to extend and consolidate our previous work on image document classification in order to adequately handle new documents from unknown classes. The proposed approaches are based on a relatively novel technology for text image representation known as “probabilistic indexing”, which proves very effective to characterise the intrinsic word-level uncertainty exhibited by historical handwritten text images. We assess the performance of this approach on a moderately sized but representative dataset extracted from a huge series of complex notarial manuscripts from the Spanish Archivo Histórico Provincial de Cádiz, with good results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call