Abstract

Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63 multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.

Highlights

  • Historical text normalization is the task of mapping variant spellings in historical documents— e.g., digitized medieval manuscripts—to a common form, typically their modern equivalent

  • Our main focus is on analyzing the usefulness of multi-task learning strategies (a) to leverage whatever supervision is available for the language in question, or (b) to do away with the need for supervision in the target language altogether

  • (b) We show that in few-shot learning scenarios, multi-task learning leads to robust, significant gains over a state-ofthe-art, single-task baseline.1 (c) We are, to the best of our knowledge, the first to consider zero-shot historical text normalization, and we show significant improvements over the simple, but relatively strong, identity baseline

Read more

Summary

Introduction

Historical text normalization is the task of mapping variant spellings in historical documents— e.g., digitized medieval manuscripts—to a common form, typically their modern equivalent. Many historical documents were written in the absence of standard spelling conventions, and annotated datasets are rare and small, making automatic normalization a challenging task (cf Piotrowski, 2012; Bollmann, 2018). We experiment with datasets in eight different languages: English, German, Hungarian, Icelandic, Portuguese, Slovene, Spanish, and Swedish. Bollmann et al (2017) previously showed that multi-task learning with grapheme-to-phoneme conversion as an auxiliary task improves a sequence-to-sequence model for historical text normalization of German texts; Bollmann et al (2018) showed that multi-task learning is helpful in low-resource scenarios. (a) We evaluate 63 multi-task learning configurations across ten datasets in eight languages, and with three different auxiliary tasks. While our focus is on the specific task of historical text normalization, we believe that our results can be of interest to anyone looking to apply multi-task learning in low-resource scenarios. Datasets We consider ten datasets spanning eight languages, taken from Bollmann (2019). Table 1 gives an overview of the languages and the size of the development set, which we use for evaluation

Model architecture
Multi-task learning
Experiment 1
Results
Experiment 2
Experiment 3
Comparison to previous work
Experiment 4
Related work
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.