Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline

Kawin Ethayarajh

doi:10.18653/v1/w18-3012

Abstract

Using a random walk model of text generation, Arora et al. (2017) proposed a strong baseline for computing sentence embeddings: take a weighted average of word embeddings and modify with SVD. This simple method even outperforms far more complex approaches such as LSTMs on textual similarity tasks. In this paper, we first show that word vector length has a confounding effect on the probability of a sentence being generated in Arora et al.’s model. We propose a random walk model that is robust to this confound, where the probability of word generation is inversely related to the angular distance between the word and sentence embeddings. Our approach beats Arora et al.’s by up to 44.4% on textual similarity tasks and is competitive with state-of-the-art methods. Unlike Arora et al.’s method, ours requires no hyperparameter tuning, which means it can be used when there is no labelled data.

Highlights

Distributed representations of words, better known as word embeddings, have become fixtures of current methods in natural language processing
We first showed that word vector length has a confounding effect on the log-linear random walk model of generating text (Arora et al, 2017), the basis of a strong baseline method for sentence embeddings
We proposed an angular distance– based random walk model where the probability of a sentence being generated is robust to distortion from word vector length

Summary

Introduction

Distributed representations of words, better known as word embeddings, have become fixtures of current methods in natural language processing. Arora et al (2017) provided a more powerful approach: compute the sentence embeddings as weighted averages of word embeddings, subtract from each one the vector projection on their first principal component. A word unrelated to cs can be produced by chance or if it is part of frequent discourse such as stopwords This approach evens outperforms more complex models such as LSTMs on textual similarity tasks. Arora et al argued that the simplicity and effectiveness of their method make it a tough-to-beat baseline for sentence embeddings Though they call their approach unsupervised, others have noted that it is ‘weakly supervised’, since it requires hyperparameter tuning (Cer et al, 2017). Effectiveness, and unsupervised nature of our method, we suggest it be used as a baseline for computing sentence embeddings

Related Work

The Log-Linear Random Walk Model

The Confounding Effect of Vector Length

An Angular Distance–Based Random Walk Model

Textual Similarity Tasks

Experimental Settings

Results

Supervised Tasks

Future Work

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 109	License type: cc-by

Similar Papers

SEBGM: Sentence Embedding Based on Generation Model with multi-task learning
Qian Wang ... Xu Wang
Computer Speech & Language | VOL. 87
Qian Wang, et. al.Qian Wang ... Xu Wang
06 Apr 2024
Computer Speech & Language | VOL. 87

CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings
Zhang Wei ... Chen Xu
-
Zhang Wei, et. al.Zhang Wei ... Chen Xu
20 May 2023
20 May 2023

Efficient Sentence Embedding via Semantic Subspace Analysis

-

29 Dec 2020
29 Dec 2020

Efficient Sentence Embedding via Semantic Subspace Analysis
Bin Wang ... C.-C Jay Kuo
-
Bin Wang, et. al.Bin Wang ... C.-C Jay Kuo
10 Jan 2021
10 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline

Abstract

Highlights

Summary

Talk to us

Similar Papers