Abstract

Masked language model and autoregressive language model are two types of language models. While pretrained masked language models such as BERT overwhelm the line of natural language understanding (NLU) tasks, autoregressive language models such as GPT are especially capable in natural language generation (NLG). In this paper, we propose a probabilistic masking scheme for the masked language model, which we call probabilistically masked language model (PMLM). We implement a specific PMLM with a uniform prior distribution on the masking ratio named u-PMLM. We prove that u-PMLM is equivalent to an autoregressive permutated language model. One main advantage of the model is that it supports text generation in arbitrary order with surprisingly good quality, which could potentially enable new applications over traditional unidirectional generation. Besides, the pretrained u-PMLM also outperforms BERT on a set of downstream NLU tasks.

Highlights

  • Large-scale pretrained language models (Raffel et al, 2019; Wang et al, 2019; Lan et al, 2019; Liu et al, 2019; Jiao et al, 2019) have drawn lots of research attention as these models have brought significant improvements to many natural language understanding (NLU) and natural language generation (NLG) tasks

  • We prove that u-probabilistically masked language model (PMLM) learns an autoregressive language model on random permutations of training sequences

  • We prove that u-PMLM is equivalent to the autoregressive permutated language model (APLM) by recombination of the factorized log-likelihood function, which is basically the autoregressive language model trained on all possible permutations of the training instances: N

Read more

Summary

Introduction

Large-scale pretrained language models (Raffel et al, 2019; Wang et al, 2019; Lan et al, 2019; Liu et al, 2019; Jiao et al, 2019) have drawn lots of research attention as these models have brought significant improvements to many NLU and NLG tasks. Unlike predicting the masked tokens, the autoregressive language model learns a sequential generative process of text sequences. It naturally performs better for natural language generation. This is very challenging for conventional generation models since when predicting each word, the fluency and coherence of text are hard to be guaranteed given the contextual constraints on both sides. U-PMLM outperforms BERT significantly on the GLUE benchmark for natural language understanding

Transformer
Autoregressive Language Model
Masked Language Model
Probabilistically Masked Language Model
Model Formulation
Generation with u-PMLM
Training Settings
Comparative Models
Autoregressive Generation
Natural Language Understanding
Non-traditional Text Generation
Conclusion
A Proof of Equivalence
Findings
B Generation Examples of u-PMLM and BERT
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call