Abstract

In this paper, we propose a scraping method for collecting tweets, which we call DeepScraper. DeepScraper provides the complete scraping for the entire tweets written by a certain group of users or them containing search keywords with a fast speed. To improve the crawling speed of DeepScraper, we devise a multiprocessing architecture while providing authentication to the multiple processes based on the simulation of the user access behavior to Twitter. This allows us to maximize the parallelism of crawling even in a single machine. Through extensive experiments, we show that DeepScraper can crawl the entire tweets of 99 users, which amounts to 5,798,052 tweets while Twitter standard API can crawl only 243,650 tweets of them due to the constraints of the number of tweets to scrape. In other words, DeepScraper could collect 23.7 times more tweets for the 99 users than the standard API. We also show the efficiency of DeepScraper. First, we show the effect of the authenticated multiprocessing by showing that it increases the crawling speed from 2.03∼10.57 times as the number of running processes increases from 2 to 32 compared to DeepScraper with a single process. Then, we compare the crawling speed of DeepScraper with the existing studies. The result shows that DeepScraper is compared to even Twitter standard APIs and Twitter4J while DeepScraper can scrape much more tweets than them. Furthermore, DeepScraper is much faster than Twitter Scrapy roughly 3.69 times while both can scrape the entire tweets for the target users or keywords.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call