Abstract

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and - as we show in this work - result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).

Highlights

  • Analyzing differences in corpora from different sources is a central use case in digital humanities and computational social science

  • Instead of trying to align two different vector spaces, we propose to work directly in the shared vocabulary space: we take the neighbors of a word in a vector space to reflect its usage, and consider words that have drastically different neighbours in the spaces induced by the different corpora to be words subjected to usage change

  • We compare our proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the Orthogonal Procrustes (OP) algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces

Read more

Summary

Introduction

Analyzing differences in corpora from different sources (different time periods, populations, geographic regions, news outlets, etc) is a central use case in digital humanities and computational social science. A particular methodology is to identify individual words that are used differently in the different corpora. This includes words that have their meaning changed over time periods (Kim et al, 2014; Kulkarni et al, 2015; Hamilton et al, 2016b; Kutuzov et al, 2018; Tahmasebi et al, 2018), and words that are used differently by different populations (Azarbonyad et al, 2017; Rudolph et al, 2017). A popular method for performing the task (§4) is to train word embeddings on each corpus and to project one space to the other using a vectorspace alignment algorithm. It is sensitive to proper nouns and requires filtering them

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.