Abstract

AbstractSubwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, “electroneutral” can be parsed as electron-eu-tral or electro-neutral, and “bidirectional” can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).

Highlights

  • Desiderata Subwords/word pieces have become quite popular recently, especially for deep nets. They are used in the front end of BERT (Devlin et al 2018) and ERNIE (Sun et al 2019), two very successful deep nets for language applications

  • BERT provides the following motivation for word pieces: “Using wordpieces gives a good balance between the flexibility of single characters and the efficiency of full words for decoding, and sidesteps the need for special treatment of unknown words.” (Devlin et al 2018)

  • Subwords are based on byte pair encoding (BPE) (Sennrich, Haddow, and Birch 2016), which borrows ideas from information theory to learn a dictionary of word pieces from a training corpus

Read more

Summary

PubMed frequency

Better alternative direction direction directional directional unidirectional un-idi-re-ction-al uni-directional bidirectional bid-ire-ction-al bi-directional bidimensional bid-ime-ns-ional bi-dimensional electroneutral electron-eu-tral electro-neutral neurotransmitter ne-uro-tra-ns-mit-ter neuro-transmitter potassium potassium dipotassium dip-ota-ssi-um di-potassium bipotassium bi-pot-ass-ium bi-potassium monopotassium mono-pot-ass-ium mono-potassium hexapotassium he-xa-pot-ass-ium hexa-potassium schizophrenic sc-hi-zo-ph-ren-ic schizophrenia − a + ic schizophrenia schizophrenia

Telephone telephone
PubMed Freq
Word pieces
Stress Neutral
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call