Abstract

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining. We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.

Highlights

  • Large transformers (Vaswani et al, 2017) pretrained with variants of a language modeling objective, such as BERT (Devlin et al, 2019), have proven their effectiveness at flexibly transferring to a variety of domains and tasks

  • Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624 November 16 - 20, 2020. c 2020 Association for Computational Linguistics publicly available implementations: byte-pair encoding (BPE) and unigram language modeling

  • While the vocabularies resulting from these schemes are heavily overlapping, we compare each method to reference morphological segmentations and find that the unigram language models (LMs) method produces tokens better aligned with morphology

Read more

Summary

Introduction

Large transformers (Vaswani et al, 2017) pretrained with variants of a language modeling objective, such as BERT (Devlin et al, 2019), have proven their effectiveness at flexibly transferring to a variety of domains and tasks. While the vocabularies resulting from these schemes are heavily overlapping, we compare each method to reference morphological segmentations and find that the unigram LM method produces tokens better aligned with morphology To understand whether this more natural tokenization leads to improved performance, we pretrain separate language models using the ROBERTA objective (Liu et al, 2019) with each tokenization for both English and Japanese, two typologically distant languages. Schuster and Nakajima (2012) note that the process of estimating language model parameters for every potential merge is prohibitive, so they employ aggressive heuristics to reduce the number of potential merges considered As their implementation is not public, we are unable to make a comparison to this method. The unigram LM method (Kudo, 2018), in contrast to the bottom-up construction process of BPE and WordPiece, begins with a superset of the final vocabulary, pruning it to the desired size:

Algorithms
15: Fit final unigram LM θ to D
Morphology
Method
Vocabulary Allocation
Downstream Task Experiments
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call