Abstract

As researchers use computational methods to study complex social behaviors at scale, the validity of this computational social science depends on the integrity of the data. On July 2, 2015, Jason Baumgartner published a dataset advertised to include “every publicly available Reddit comment” which was quickly shared on Bittorrent and the Internet Archive. This data quickly became the basis of many academic papers on topics including machine learning, social behavior, politics, breaking news, and hate speech. We have discovered substantial gaps and limitations in this dataset which may contribute to bias in the findings of that research. In this paper, we document the dataset, substantial missing observations in the dataset, and the risks to research validity from those gaps. In summary, we identify strong risks to research that considers user histories or network analysis, moderate risks to research that compares counts of participation, and lesser risk to machine learning research that avoids making representative claims about behavior and participation on Reddit.

Highlights

  • A user who deletes even one comment in their posting history introduces many of the problems we describe in this paper, even if the fact of the comment is recorded in the Baumgartner dataset

  • We have shown ways in which an influential public dataset does not represent the “complete” record that its publisher and users aspired to

  • We have outlined the risks to research validity represented by these data gaps, including some of our own work

Read more

Summary

The Baumgartner Reddit Corpus

Trace data sourced from online platforms has become an essential component for many forms of research ranging from sentiment analysis [1] to epidemiological modeling [2] and economics [3]. Dominant social platforms such as Twitter and Facebook have provided researchers with opportunities to directly study complex phenomena that, at their root, rely strongly on the nature of social interaction [4]. Computational social science: Large-scale missing data in a widely-published Reddit corpus migration through online platforms [7, 11], hate speech [12], and online behavior research methodology [13], among others.

Sequential ID analysis
Diagnosing missing data
The per-user risk of missing data
Distribution of gaps across time
Distribution of gaps across communities
Risk to user history analyses
Risks to network analyses
Risks to research that counts and compares participation between communities
Risks to machine learning models
Findings
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.