Morphology Matters: A Multilingual Language Modeling Analysis

Hyunji Hayley Park,Lane Schwartz,Kenneth Steimel,Coleman Haley,Han Liu,Katherine J Zhang

doi:10.1162/tacl_a_00365

Abstract

Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.

Highlights

With most research in Natural Language Processing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known
Our results show that Byte-Pair Encoding (BPE) language modeling surprisal is significantly correlated with measures of morphological typology and complexity
We report the strong association between several morphological features and surprisal per verse for BPE language models, compared to language models based on other segmentation methods

Summary

Introduction

With most research in Natural Language Processing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known. Gerz et al (2018) and Cotterell et al (2018) find that morphological complexity is predictive of language modeling difficulty, while Mielke et al (2019) conclude that simple statistics of a text like the number of types explain differences in modeling difficulty, rather than morphological measures. This paper revisits this issue by increasing the number of languages considered and augmenting the kind and number of morphological features used. We investigate how this measure is correlated with 12 linguistgenerated morphological features and four corpusbased measures of morphological complexity

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Mar 17, 2021
Citations: 11	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Morphology Matters: A Multilingual Language Modeling Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Morphology Matters: A Multilingual Language Modeling Analysis
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
...
Zenodo (CERN European Organization for Nuclear Research) | VOL. -
, et. al. ...
10 May 2021
Zenodo (CERN European Organization for Nuclear Research) | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Morphology Matters: A Multilingual Language Modeling Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics