Abstract

In the era of Big Data, the Internet has become one of the main data sources: Data can be collected for relatively low costs and can be used for a wide range of purposes. To be able to timely support solid decisions in any field, it is essential to increase data production efficiency, data accuracy, and reliability. In this framework, our paper aims at identifying an optimized and flexible method to collect and, at the same time, geolocate social media information over a whole country. In particular, the target of this paper is to compare three alternative methods to collect data from the social media Twitter. This is achieved considering four main comparison criteria: Collection time, dataset size, pre-processing phase load, and geographic distribution. Our findings regarding Great Britain identify one of these methods as the best option, since it is able to collect both the highest number of tweets per hour and the highest percentage of unique tweets per hour. Furthermore, this method reduces the computational effort needed to pre-process the collected tweets (e.g., showing the lowest collection times and the lowest number of duplicates within the geographical areas) and enhances the territorial coverage (if compared to the population distribution). At the same time, the effort required to set up this method is feasible and less prone to the arbitrary decisions of the researcher.

Highlights

  • We are living in an era characterized by a constant and massive production of a huge amount of data on a daily basis. In this world of “big data”, research linked to data collection methods and data quality issues is becoming more and more important, since these aspects can have a relevant role in a lot of decisionmaking processes

  • A huge amount of data can be retrieved from social networks [2,3,4] or by means of web-scraping techniques [5]

  • The first main objective of this paper is to identify an optimized and flexible method to collect social media data

Read more

Summary

Introduction

We are living in an era characterized by a constant and massive production of a huge amount of data on a daily basis (or even at higher frequencies) In this world of “big data”, research linked to data collection methods and data quality issues is becoming more and more important, since these aspects can have a relevant role in a lot of decisionmaking processes. Data production efficiency and data accuracy and reliability are fundamental premises for taking timely and solid decisions From this perspective, the Internet represents a very interesting new type of data source and research tool. The data can be made available immediately, producing very large sized datasets that can be used for a wide range of purposes They can be employed for supporting theoretical and applied research projects and/or as evidence

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call