Distributed crawling is one of the mainstream text data collection technologies, which is essential for mining boundless data available on the Internet for users. Internet information is clustered by the correlation of keywords, and users employ search engines to retrieve relevant keywords to get the information they care about. Distributed crawlers are used to mine the Internet information by simulating users’ behavior, the more important keywords that users care about, the higher the correlation of data to keywords. In order to preferentially collect information that users care about with minimal resource consumption, in this paper, we design a scheduling framework and propose a novel scheduling strategy based on hunger for distributed crawler. We first define the load capacity of distributed crawler as hunger which reflects the ability to complete tasks and divide keywords queues into sub-queues based on the hunger of distributed crawlers. Then, we use a vector space model and cosine similarity algorithm to learn the correlation of keywords to text data and apply the optimized logistic algorithm to measure the importance of keywords. Meanwhile, we design a comprehensive evaluation algorithm to quantify the contribution of keywords, so that updating sub-queues order. Finally, new sub-queues are used in the deeper scheduling to preferentially get data that users desire and sacrifice the least number of resources. Experimental results demonstrate that our method optimizes the scheduling procedures and makes crawling more efficient with less run time.
Read full abstract