Abstract

Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.

Highlights

  • With most research in Natural Language Processing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known

  • Our results show that Byte-Pair Encoding (BPE) language modeling surprisal is significantly correlated with measures of morphological typology and complexity

  • We report the strong association between several morphological features and surprisal per verse for BPE language models, compared to language models based on other segmentation methods

Read more

Summary

Introduction

With most research in Natural Language Processing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known. Gerz et al (2018) and Cotterell et al (2018) find that morphological complexity is predictive of language modeling difficulty, while Mielke et al (2019) conclude that simple statistics of a text like the number of types explain differences in modeling difficulty, rather than morphological measures. This paper revisits this issue by increasing the number of languages considered and augmenting the kind and number of morphological features used. We investigate how this measure is correlated with 12 linguistgenerated morphological features and four corpusbased measures of morphological complexity

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call