Abstract

Archives around the world have vast uncatalogued series of image bundles of digitized historical manuscripts containing, among others, notarial records also known as “deeds” or “acts”. One of the first steps to provide metadata which describe the contents of those bundles is to segment these bundles into their individual deeds. Even if deeds are page-aligned, as in the bundles considered in the present work, this is a time-consuming task, often prohibitive given the huge scale of the manuscript series involved. Unlike traditional Layout Analysis methods for page-level segmentation, our approach goes beyond the realm of a single-page image, providing consistent deed detection results on full bundles. This is achieved in two tightly integrated steps: first, the probabilities that each bundle image is an “initial”, “middle” or “final” page of a deed are estimated, and then an optimal sequence of page labels is computed at the whole bundle level. Empirical results are reported which show that this approach achieves almost perfect segmentation of bundles of a massive Spanish series of historical notarial records.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call