Abstract
As researchers use computational methods to study complex social behaviors at scale, the validity of this computational social science depends on the integrity of the data. On July 2, 2015, Jason Baumgartner published a dataset advertised to include “every publicly available Reddit comment” which was quickly shared on Bittorrent and the Internet Archive. This data quickly became the basis of many academic papers on topics including machine learning, social behavior, politics, breaking news, and hate speech. We have discovered substantial gaps and limitations in this dataset which may contribute to bias in the findings of that research. In this paper, we document the dataset, substantial missing observations in the dataset, and the risks to research validity from those gaps. In summary, we identify strong risks to research that considers user histories or network analysis, moderate risks to research that compares counts of participation, and lesser risk to machine learning research that avoids making representative claims about behavior and participation on Reddit.
Highlights
A user who deletes even one comment in their posting history introduces many of the problems we describe in this paper, even if the fact of the comment is recorded in the Baumgartner dataset
We have shown ways in which an influential public dataset does not represent the “complete” record that its publisher and users aspired to
We have outlined the risks to research validity represented by these data gaps, including some of our own work
Summary
Trace data sourced from online platforms has become an essential component for many forms of research ranging from sentiment analysis [1] to epidemiological modeling [2] and economics [3]. Dominant social platforms such as Twitter and Facebook have provided researchers with opportunities to directly study complex phenomena that, at their root, rely strongly on the nature of social interaction [4]. Computational social science: Large-scale missing data in a widely-published Reddit corpus migration through online platforms [7, 11], hate speech [12], and online behavior research methodology [13], among others.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.