A hunger-based scheduling strategy for distributed crawler

Xi Wang,Zhichao Chen,Mingming Kong,Bo Li

doi:10.1016/j.eswa.2023.119798

Abstract

Distributed crawling is one of the mainstream text data collection technologies, which is essential for mining boundless data available on the Internet for users. Internet information is clustered by the correlation of keywords, and users employ search engines to retrieve relevant keywords to get the information they care about. Distributed crawlers are used to mine the Internet information by simulating users’ behavior, the more important keywords that users care about, the higher the correlation of data to keywords. In order to preferentially collect information that users care about with minimal resource consumption, in this paper, we design a scheduling framework and propose a novel scheduling strategy based on hunger for distributed crawler. We first define the load capacity of distributed crawler as hunger which reflects the ability to complete tasks and divide keywords queues into sub-queues based on the hunger of distributed crawlers. Then, we use a vector space model and cosine similarity algorithm to learn the correlation of keywords to text data and apply the optimized logistic algorithm to measure the importance of keywords. Meanwhile, we design a comprehensive evaluation algorithm to quantify the contribution of keywords, so that updating sub-queues order. Finally, new sub-queues are used in the deeper scheduling to preferentially get data that users desire and sacrifice the least number of resources. Experimental results demonstrate that our method optimizes the scheduling procedures and makes crawling more efficient with less run time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A hunger-based scheduling strategy for distributed crawler

Abstract

Talk to us

Similar Papers

More From: Expert Systems With Applications

Lead the way for us

Journal: Expert Systems With Applications	Publication Date: Mar 6, 2023
Citations: 1

Similar Papers

N-layer Approach to Web Information Retrieval
H.B Kekre ... S.S Sane
International Journal of Applied Information Systems | VOL. 5
H.B Kekre, et. al.H.B Kekre ... S.S Sane
10 Jan 2013
International Journal of Applied Information Systems | VOL. 5

Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain
Diogo Da Silva Magalhães Gomes ... Renata Cristina Texeira
-
Diogo Da Silva Magalhães Gomes, et. al.Diogo Da Silva Magalhães Gomes ... Renata Cristina Texeira
28 Oct 2019
28 Oct 2019

Tracking the Evolution of Words with Time-reflective Text Representations
Roberto Camacho Barranco ... M Shahriar Hossain
-
Roberto Camacho Barranco, et. al.Roberto Camacho Barranco ... M Shahriar Hossain
01 Dec 2018
01 Dec 2018

Unsupervised Measure of Word Similarity: How to Outperform Co-Occurrence and Vector Cosine in VSMs
Enrico Santus ... Alessandro Lenci
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 30
Enrico Santus, et. al.Enrico Santus ... Alessandro Lenci
05 Mar 2016
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A hunger-based scheduling strategy for distributed crawler

Abstract

Talk to us

Similar Papers

More From: Expert Systems With Applications