Unsupervised discovery of non-trivial similarities between online communities

Abraham Israeli,Shani Cohen,Oren Tsur

doi:10.1016/j.eswa.2022.117900

Abstract

Language is used differently across communities. The differences may be manifested in vocabulary, style, and semantics. These differences enable the exploration of nuanced similarities and differences between communities. In this work, we introduce C3 — a novel unsupervised approach for community comparison. C3 creates contextual pairwise representations by aligning communities and tuning word embeddings according to both the lexical context and the social context reflected by the community’s structure and the community engagement patterns. Specifically, C3 takes into account the semantic relations between pairs of words, reflected by the embeddings model of each community, and leverages the social context and users’ role in their community to calculate a similarity measure between community pairs. C3 is evaluated over a dataset of 1565 active Reddit communities, comparing results against three competitive models. We show through an array of experiments and validations that C3 recovers nuanced and not-trivial similarities between communities that are not captured by any of the competitive models. We complement the quantitative results with a qualitative analysis, discussing recovered non-trivial similarities between community pairs such as: opiates and adhd, babyBumps and depression, wallStreetBets and sandersForPresident, all of which are recovered by C3 but not by any of the other models. This qualitative analysis demonstrates the exploratory power of our model.

Full Text