Distributed OSN Crawling System based on Ajax Simulation

Shan Jixi,Sha Ying,Li Yang,Xu Kai,Guo Li

doi:10.1016/j.procs.2013.05.105

Abstract

In the age of Web2.0, lots of online social networks (OSNs) like Facebook, Twitter, WeiBo become the most popular information transform platform, which catch more and more attention from Information Retrieval (IR). However, traditional web crawling System get into trouble because of the complicated OSN web pages, the rapid message exploding and the heavy using of Asynchronous JavaScript and XML(AJAX). We design and implement a distributed system based on Message Oriented Middleware (MOM) and Ajax simulation, which crawls 70 millions of Twitter detail items in one month. The data Acquisition shows that the crawling with Ajax simulation is able to get items loaded by Ajax without limitations, the distributed system based on MOM and Ajax simulation is able to crawl massive OSN data completely, quickly, frequently and unrestrictedly.

Full Text