Ground-truth generation through crowdsourcing with probabilistic indexes

Joan Andreu Sánchez,Enrique Vidal,Vicente Bosch,Lorenzo Quirós

doi:10.1007/s00521-024-10188-0

Abstract

Automatic transcription of large series of historical handwritten documents generally aims at allowing to search for textual information in these documents. However, automatic transcripts often lack the level of accuracy needed for reliable text indexing and search purposes. Probabilistic Indexing (PrIx) offers a unique alternative to raw transcripts. Since it needs training data to achieve good search performance, PrIx-based crowdsourcing techniques are introduced in this paper to gather the required data. In the proposed approach, PrIx confidence measures are used to drive a correction process in which users can amend errors and possibly add missing text. In a further step, corrected data are used to retrain the PrIx models. Results on five large series are reported which show consistent improvements after retraining. However, it can be argued whether the overall costs of the crowdsourcing operation pay off for the improvements, or perhaps it would have been more cost-effective to just start with a larger and cleaner amount of professionally produced training transcripts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Ground-truth generation through crowdsourcing with probabilistic indexes

Abstract

Talk to us

Similar Papers

More From: Neural Computing and Applications

Lead the way for us

Journal: Neural Computing and Applications	Publication Date: Aug 1, 2024
License type: cc-by

Similar Papers

A Study of English Neologisms Through Large-Scale Probabilistic Indexing of Bentham’s Manuscripts
Alejandro H Toselli ... Louise Seaward
-
Alejandro H Toselli, et. al.Alejandro H Toselli ... Louise Seaward
01 Jan 2019
01 Jan 2019

Probabilistic Indexing and Search for Information Extraction on Handwritten German Parish Records
Eva Lang ... Alejandro Hector Toselli
-
Eva Lang, et. al.Eva Lang ... Alejandro Hector Toselli
01 Aug 2018
01 Aug 2018

The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification
E Vidal ... C Orcero
-
E Vidal, et. al.E Vidal ... C Orcero
01 Sep 2020
01 Sep 2020

Lexicon-based probabilistic indexing of handwritten text images
Enrique Vidal ... Joan Puigcerver
Neural Computing and Applications | VOL. 35
Enrique Vidal, et. al.Enrique Vidal ... Joan Puigcerver
10 May 2023
Neural Computing and Applications | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ground-truth generation through crowdsourcing with probabilistic indexes

Abstract

Talk to us

Similar Papers

More From: Neural Computing and Applications