SummaryWith the rapid development of the network, stand‐alone crawlers are finding hard to find and gather information. Distributed crawlers are gradually accepted to solve this problem. This paper proposes a task scheduling strategy based on weighted round robin for small‐scale distributed crawler with formula weights for the current node based on crawling efficiency, implements a distributed crawler system with multithreading support and deduplication which takes the algorithm as core, and discusses some possible extensions and details. The design of the error recovery mechanism and the node table allows crawling nodes have flexible scalability and fault tolerance. Finally, we conducted some experiments to prove the good load balancing performance of the system. Concurrency and Computation: Practice and Experience, 2015.© 2015 Wiley Periodicals, Inc. Copyright © 2015 John Wiley & Sons, Ltd.
Read full abstract