Improving offline HTR in small datasets by purging unreliable labels

Jose Carlos Aradillas,Pablo M Olmos,Juan Jose Murillo-Fuentes

doi:10.1109/icfhr2020.2020.00016

Abstract

This paper focuses on the offline handwriting text recognition problem (HTR) with small training data sets. Some techniques such as transfer learning or data augmentation have recently been applied to this problem, improving the performance of the recognition. In these scenarios, we found that errors in the labelling of the training samples, present in some databases, have a great impact in the character error rates (CER). Accordingly, we propose a novel cross validation technique to remove incorrect labelled lines. In this approach, after a first training stage, transcript lines with CER above a threshold are discarded, where the threshold is a function of the available data. Less available data favours larger CER, even for healthy lines, suggesting higher thresholds for fewer lines. This new technique and the validation of the threshold are analyzed over the ICFHR 2018 competition on automated HTR and other well known databases such as Washington and Parzival. For the Ricordi database in the ICFHR 2018, with transcription errors, we report a reduction of CER by 2%.

Full Text