Masked Language Model Scoring

Julian Salazar,Toan Q Nguyen,Davis Liang,Katrin Kirchhoff

doi:10.18653/v1/2020.acl-main.240

Abstract

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at https://github.com/awslabs/mlm-scoring.

Highlights

BERT (Devlin et al, 2019) and its improvements to natural language understanding have spurred a rapid succession of contextual language representations (Yang et al, 2019; Liu et al, 2019; inter alia) which use larger datasets and more involved training schemes
In Appendix C, we plot sentence-level pseudo-log-likelihood scores (PLLs) versus |W | and observe linearity as |W | → ∞, with spikes from the last word and lowercase first word smoothing out. This behavior motivates our choice of α = 1.0 when applying the Google neural machine translation (NMT)-style length penalty (Wu et al, 2016) to PLLs, which corresponds to the asymptoticallylinear LPMLM = (5 + |W |)/(5 + 1)
We studied scoring with masked language models (MLMs) pseudo-loglikelihood scores in a variety of settings

Summary

Introduction

BERT (Devlin et al, 2019) and its improvements to natural language understanding have spurred a rapid succession of contextual language representations (Yang et al, 2019; Liu et al, 2019; inter alia) which use larger datasets and more involved training schemes Their success is attributed to their use of bidirectional context, often via their masked language model (MLM) objectives. 2019), given by summing the conditional log probabilities log PMLM(wt | W\t) of each sentence token (Shin et al, 2019) These are induced in BERT by replacing wt with [MASK] (Figure 1). We use PLLs to perform unsupervised acceptability judgments on the BLiMP minimal pairs set (Warstadt et al, 2020); BERT and RoBERTa models improve the state of the art (GPT-2 probabilities) by up to 3.9% absolute, with +10% on island effects and NPI licensing phenomena. PLLs can be used to assess the linguistic competence of MLMs in a supervision-free manner

Pseudolikelihood estimation

Sequence-to-sequence rescoring

Pseudo-perplexity

The log-linear model

Experimental setup

Domain adaptation

Finetuning without masking

Analysis

Relative linguistic acceptability

Interpolation with direct models

Numerical properties of PLL

Related work

Conclusion

Language models

Automatic speech recognition

Neural machine translation

B BERT as a generative model

C Pseudo-perplexity and rescoring

Findings

D Combining MLMs and GPT-2

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Masked Language Model Scoring

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 194	License type: cc-by

Similar Papers

Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order
Qun Liu ... Yi Liao
-
Qun Liu, et. al.Qun Liu ... Yi Liao
01 Jan 2020
01 Jan 2020

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction
Masahiro Kaneko ... Kentaro Inui
-
Masahiro Kaneko, et. al.Masahiro Kaneko ... Kentaro Inui
01 Jan 2020
01 Jan 2020

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction
Masahiro Kaneko
Journal of Natural Language Processing | VOL. 27
Masahiro KanekoMasahiro Kaneko
15 Sep 2020
Journal of Natural Language Processing | VOL. 27

Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

-

01 Aug 2021
01 Aug 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Masked Language Model Scoring

Abstract

Highlights

Summary

Talk to us

Similar Papers