Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

Hila Gonen,Djamé Seddah,Ganesh Jawahar,Yoav Goldberg

doi:10.18653/v1/2020.acl-main.51

Abstract

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and - as we show in this work - result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).

Highlights

Analyzing differences in corpora from different sources is a central use case in digital humanities and computational social science
Instead of trying to align two different vector spaces, we propose to work directly in the shared vocabulary space: we take the neighbors of a word in a vector space to reflect its usage, and consider words that have drastically different neighbours in the spaces induced by the different corpora to be words subjected to usage change
We compare our proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the Orthogonal Procrustes (OP) algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces

Summary

Introduction

Analyzing differences in corpora from different sources (different time periods, populations, geographic regions, news outlets, etc) is a central use case in digital humanities and computational social science. A particular methodology is to identify individual words that are used differently in the different corpora. This includes words that have their meaning changed over time periods (Kim et al, 2014; Kulkarni et al, 2015; Hamilton et al, 2016b; Kutuzov et al, 2018; Tahmasebi et al, 2018), and words that are used differently by different populations (Azarbonyad et al, 2017; Rudolph et al, 2017). A popular method for performing the task (§4) is to train word embeddings on each corpus and to project one space to the other using a vectorspace alignment algorithm. It is sensitive to proper nouns and requires filtering them

Objectives

Methods

Results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 64	License type: cc-by

Similar Papers

Towards Computational and Behavioral Social Science
Rosaria Conte ... Francesca Giardini
European Psychologist | VOL. 21
Rosaria Conte, et. al.Rosaria Conte ... Francesca Giardini
01 Apr 2016
European Psychologist | VOL. 21

Data-Driven Computational Social Network Science: Predictive and Inferential Models for Web-Enabled Scientific Discoveries.
Frank Emmert-Streib ... Matthias Dehmer
Frontiers in big data | VOL. 4
Frank Emmert-Streib, et. al.Frank Emmert-Streib ... Matthias Dehmer
22 Apr 2021
Frontiers in big data | VOL. 4

Moving Between Scales
Wenyi Shang
Proceedings of the ALISE Annual Conference | VOL. -
Wenyi ShangWenyi Shang
29 Sep 2023
Proceedings of the ALISE Annual Conference | VOL. -

GIS and Computational Social Sciences
Xinyue Ye
-
Xinyue YeXinyue Ye
21 Feb 2023
21 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

Abstract

Highlights

Summary

Talk to us

Similar Papers