On the Sentence Embeddings from Pre-trained Language Models

Bohan Li,Lei Li,Yiming Yang,Junxian He,Mingxuan Wang,Hao Zhou

doi:10.18653/v1/2020.emnlp-main.733

Abstract

Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective. Experimental results show that our proposed BERT-flow method obtains significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks. The code is available at https://github.com/bohanli/BERT-flow.

Highlights

Pre-trained language models and its variants (Radford et al, 2019; Devlin et al, 2019; Yang et al, 2019; Liu et al, 2019) like BERT (Devlin et al, 2019) have been widely used as representations of natural language
Inspired by Gao et al (2019) who find that the language modeling performance can be limited by the learned anisotropic word embedding space where the word embeddings occupy a narrow cone, and Ethayarajh (2019) who find that BERT word embeddings suffer from anisotropy, we hypothesize that the sentence embeddings from BERT – as average of context embeddings from last layers1 – may suffer from similar issues
In addition to the semantic textual similarity tasks, we examine the effectiveness of our method on unsupervised question-answer entailment

Summary

Introduction

Pre-trained language models and its variants (Radford et al, 2019; Devlin et al, 2019; Yang et al, 2019; Liu et al, 2019) like BERT (Devlin et al, 2019) have been widely used as representations of natural language Despite their great success on many NLP tasks through fine-tuning, the sentence embeddings from BERT without finetuning are significantly inferior in terms of semantic textual similarity (Reimers and Gurevych, 2019) – for example, they even underperform the GloVe (Pennington et al, 2014) embeddings which are not contextualized and trained with a much simpler model. Through empirical probing over the embeddings, we further observe that the BERT sentence embedding space is semantically non-smoothing and poorly defined in some areas, which makes it hard to be used directly through simple similarity metrics such as dot Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9119–9130, November 16–20, 2020. c 2020 Association for Computational Linguistics product or cosine similarity

Objectives

Methods

Results