Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus.

Devin Gaffney,J Nathan Matias

doi:10.1371/journal.pone.0200162

Devin Gaffney, J Nathan Matias

Open Access

https://doi.org/10.1371/journal.pone.0200162

Copy DOI

Journal: PLOS ONE	Publication Date: Jul 6, 2018
Citations: 72	License type: CC BY 4.0

Affiliation: Northeastern University, Princeton University

Abstract

As researchers use computational methods to study complex social behaviors at scale, the validity of this computational social science depends on the integrity of the data. On July 2, 2015, Jason Baumgartner published a dataset advertised to include “every publicly available Reddit comment” which was quickly shared on Bittorrent and the Internet Archive. This data quickly became the basis of many academic papers on topics including machine learning, social behavior, politics, breaking news, and hate speech. We have discovered substantial gaps and limitations in this dataset which may contribute to bias in the findings of that research. In this paper, we document the dataset, substantial missing observations in the dataset, and the risks to research validity from those gaps. In summary, we identify strong risks to research that considers user histories or network analysis, moderate risks to research that compares counts of participation, and lesser risk to machine learning research that avoids making representative claims about behavior and participation on Reddit.

Highlights

A user who deletes even one comment in their posting history introduces many of the problems we describe in this paper, even if the fact of the comment is recorded in the Baumgartner dataset
We have shown ways in which an influential public dataset does not represent the “complete” record that its publisher and users aspired to
We have outlined the risks to research validity represented by these data gaps, including some of our own work

Summary

The Baumgartner Reddit Corpus

Trace data sourced from online platforms has become an essential component for many forms of research ranging from sentiment analysis [1] to epidemiological modeling [2] and economics [3]. Dominant social platforms such as Twitter and Facebook have provided researchers with opportunities to directly study complex phenomena that, at their root, rely strongly on the nature of social interaction [4]. Computational social science: Large-scale missing data in a widely-published Reddit corpus migration through online platforms [7, 11], hate speech [12], and online behavior research methodology [13], among others.

Sequential ID analysis

Diagnosing missing data

The per-user risk of missing data

Distribution of gaps across time

Distribution of gaps across communities

Risk to user history analyses

Risks to network analyses

Risks to research that counts and compares participation between communities

Risks to machine learning models

Findings

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Towards Computational and Behavioral Social Science
Rosaria Conte ... Francesca Giardini
European Psychologist | VOL. 21
Rosaria Conte, et. al.Rosaria Conte ... Francesca Giardini
01 Apr 2016
European Psychologist | VOL. 21

Utilizing statistical physics and machine learning to discover collective behavior on temporal social networks
Yi-Xiu Kong ... Gui-Yuan Shi
Information Processing & Management | VOL. 60
Yi-Xiu Kong, et. al.Yi-Xiu Kong ... Gui-Yuan Shi
01 Dec 2022
Information Processing & Management | VOL. 60

Analytical sociology and computational social science
Marc Keuschnigg ... Niclas Lovsjö
Journal of Computational Social Science | VOL. 1
Marc Keuschnigg, et. al.Marc Keuschnigg ... Niclas Lovsjö
21 Nov 2017
Journal of Computational Social Science | VOL. 1

Sinhala Hate Speech Detection in Social Media using Text Mining and Machine learning
H.M.S.T Sandaruwan ... S.A.S Lorensuhewa
-
H.M.S.T Sandaruwan, et. al.H.M.S.T Sandaruwan ... S.A.S Lorensuhewa
01 Sep 2019
01 Sep 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE