Abstract

Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.