Abstract

Distributed crawling is one of the mainstream text data collection technologies, which is essential for mining boundless data available on the Internet for users. Internet information is clustered by the correlation of keywords, and users employ search engines to retrieve relevant keywords to get the information they care about. Distributed crawlers are used to mine the Internet information by simulating users’ behavior, the more important keywords that users care about, the higher the correlation of data to keywords. In order to preferentially collect information that users care about with minimal resource consumption, in this paper, we design a scheduling framework and propose a novel scheduling strategy based on hunger for distributed crawler. We first define the load capacity of distributed crawler as hunger which reflects the ability to complete tasks and divide keywords queues into sub-queues based on the hunger of distributed crawlers. Then, we use a vector space model and cosine similarity algorithm to learn the correlation of keywords to text data and apply the optimized logistic algorithm to measure the importance of keywords. Meanwhile, we design a comprehensive evaluation algorithm to quantify the contribution of keywords, so that updating sub-queues order. Finally, new sub-queues are used in the deeper scheduling to preferentially get data that users desire and sacrifice the least number of resources. Experimental results demonstrate that our method optimizes the scheduling procedures and makes crawling more efficient with less run time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.