Abstract

Handwritten text recognition and Word Retrieval, also known as Word Spotting, are traditional problems in the document analysis community. While the use of increasingly large neural network architectures has led to a steady improvement of performances it comes with the drawback of requiring manually annotated training data. This poses a tremendous problem considering their application to new document collections. To overcome this drawback, we propose a self-training approach that allows to train state-of-the-art models for HTR and word spotting. Self-training is a common technique in semi-supervised learning and usually relies on a small labeled dataset and training on pseudo-labels generated by an initial model. In this work, we show that it is feasible to train models on synthetic data that are sufficiently performant to serve as initial models for self-training. Therefore, the proposed training method does not rely on any manually annotated samples. We further investigate visual and language properties of the synthetic datasets. In order to improve performance and robustness of the self-training approach, we propose different confidence measures for both models that allow to identify and remove erroneous pseudo-labels. The presented training approach clearly outperforms other learning-free methods or adaptation strategies under the absence of manually annotated data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.