Abstract

With the rapid development of microblog technology, many interesting research issues in microblog have aroused growing attention. Fetching data from microblog is the groundwork of these researches. In this paper we propose a flexible multithreading microblog crawling architecture based on the classic multiproducers and multi-consumers model, and further implement a high efficient incremental microblog crawler towards Sina Microblog (also called Weibo). The designed crawler can solve the vertical crawling, dynamic webpage and automatic loginning problems which can’t be solved by the general crawler. Meanwhile it can achieve high-precision structured webdata extraction. Some measurements are designed to evaluate the crawling performance. Experimental results demonstrate that the crawler can achieve over 95% coverage and a good freshness.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.