Abstract
Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.
Highlights
With most research in Natural Language Processing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known
Our results show that Byte-Pair Encoding (BPE) language modeling surprisal is significantly correlated with measures of morphological typology and complexity
We report the strong association between several morphological features and surprisal per verse for BPE language models, compared to language models based on other segmentation methods
Summary
With most research in Natural Language Processing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known. Gerz et al (2018) and Cotterell et al (2018) find that morphological complexity is predictive of language modeling difficulty, while Mielke et al (2019) conclude that simple statistics of a text like the number of types explain differences in modeling difficulty, rather than morphological measures. This paper revisits this issue by increasing the number of languages considered and augmenting the kind and number of morphological features used. We investigate how this measure is correlated with 12 linguistgenerated morphological features and four corpusbased measures of morphological complexity
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Transactions of the Association for Computational Linguistics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.