Twitter data has been widely used by researchers across various social and computer science disciplines. A common aim when working with Twitter data is the construction of a random sample of users from a given country. However, while several methods have been proposed in the literature, their comparative performance is mostly unexplored. In this paper, we implement four common methods to create a random sample of Twitter users in the US: 1% Stream, Bounding Box, Location Query, and Language Query. Then, we compare these methods according to their tweet- and user-level metrics as well as their accuracy in estimating the US population. Our results show that users collected by the 1% Stream method tend to have more tweets, tweets per day, followers, and friends, a fewer number of likes, are younger accounts, and include more male users compared to the other three methods. Moreover, it achieves the minimum error in estimating the US population. However, the 1% Stream method is time-consuming, cannot be used for the past time frames, and is not suitable when user engagement is part of the study. In situation where these three drawbacks are important, our results support the Bounding Box method as the second-best method.
Read full abstract