Abstract

Web crawler is an important tool to obtain information from the Internet in a timely manner. In a typical web crawler system with limited bandwidth, many websites are crawled with different time constraints. Existing studies regarding web crawler systems do not consider the bandwidth allocation in such a complex environment; hence, the time constraints may not be satisfied. In this study, we investigate the bandwidth allocation approaches for such a web crawler system. The approaches are designed for two scenarios, i.e., when the number of websites exceeds or does not exceed the maximum number of web crawlers that the system can execute simultaneously. For the latter situation, we propose approaches to control the bandwidth for web crawlers to minimize the maximum complete time or minimize the sum of execution times of all web crawlers, considering assumptions of both sufficient and insufficient bandwidths. For the former situation, we propose a round-based reallocation approach to schedule both the sequence and bandwidth allocation of the web crawlers. Extensive simulations are conducted to validate the proposed approaches, and the results show that our approaches satisfy the time constraints well and achieve desirable execution performances in various scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call