A High Efficient Incremental Microblog Crawler: Design and Implementation

Dayong Shen

doi:10.12733/jics20101638

Abstract

With the rapid development of microblog technology, many interesting research issues in microblog have aroused growing attention. Fetching data from microblog is the groundwork of these researches. In this paper we propose a flexible multithreading microblog crawling architecture based on the classic multiproducers and multi-consumers model, and further implement a high efficient incremental microblog crawler towards Sina Microblog (also called Weibo). The designed crawler can solve the vertical crawling, dynamic webpage and automatic loginning problems which can’t be solved by the general crawler. Meanwhile it can achieve high-precision structured webdata extraction. Some measurements are designed to evaluate the crawling performance. Experimental results demonstrate that the crawler can achieve over 95% coverage and a good freshness.

Full Text