Byte Pair Encoding is Suboptimal for Language Model Pretraining

Kaj Bostrom,Greg Durrett

doi:10.18653/v1/2020.findings-emnlp.414

Abstract

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining. We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.

Highlights

Large transformers (Vaswani et al, 2017) pretrained with variants of a language modeling objective, such as BERT (Devlin et al, 2019), have proven their effectiveness at flexibly transferring to a variety of domains and tasks
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624 November 16 - 20, 2020. c 2020 Association for Computational Linguistics publicly available implementations: byte-pair encoding (BPE) and unigram language modeling
While the vocabularies resulting from these schemes are heavily overlapping, we compare each method to reference morphological segmentations and find that the unigram language models (LMs) method produces tokens better aligned with morphology

Summary

Introduction

Large transformers (Vaswani et al, 2017) pretrained with variants of a language modeling objective, such as BERT (Devlin et al, 2019), have proven their effectiveness at flexibly transferring to a variety of domains and tasks. While the vocabularies resulting from these schemes are heavily overlapping, we compare each method to reference morphological segmentations and find that the unigram LM method produces tokens better aligned with morphology To understand whether this more natural tokenization leads to improved performance, we pretrain separate language models using the ROBERTA objective (Liu et al, 2019) with each tokenization for both English and Japanese, two typologically distant languages. Schuster and Nakajima (2012) note that the process of estimating language model parameters for every potential merge is prohibitive, so they employ aggressive heuristics to reduce the number of potential merges considered As their implementation is not public, we are unable to make a comparison to this method. The unigram LM method (Kudo, 2018), in contrast to the bottom-up construction process of BPE and WordPiece, begins with a superset of the final vocabulary, pruning it to the desired size:

Algorithms

15: Fit final unigram LM θ to D

Morphology

Method

Vocabulary Allocation

Downstream Task Experiments

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Byte Pair Encoding is Suboptimal for Language Model Pretraining

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 77	License type: cc-by

Similar Papers

Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method
Fenfang Li ... Zhengzhang Zhao
Applied Sciences | VOL. 14
Fenfang Li, et. al.Fenfang Li ... Zhengzhang Zhao
02 Apr 2024
Applied Sciences | VOL. 14

Can Identifier Splitting Improve Open-Vocabulary Language Model of Code?
Jieke Shi ... Bowen Xu
-
Jieke Shi, et. al.Jieke Shi ... Bowen Xu
01 Mar 2022
01 Mar 2022

A Study of Pre-trained Language Models in Natural Language Processing
Jiajia Duan ... Qian Zhou
-
Jiajia Duan, et. al.Jiajia Duan ... Qian Zhou
01 Nov 2020
01 Nov 2020

The Solution of Huawei Cloud & Noah’s Ark Lab to the NLPCC-2020 Challenge: Light Pre-Training Chinese Language Model for NLP Task
Yuyang Zhang ... Jintao Yu
-
Yuyang Zhang, et. al.Yuyang Zhang ... Jintao Yu
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Byte Pair Encoding is Suboptimal for Language Model Pretraining

Abstract

Highlights

Summary

Talk to us

Similar Papers