Abstract
Web crawler is an important tool to obtain information from the Internet in a timely manner. In a typical web crawler system with limited bandwidth, many websites are crawled with different time constraints. Existing studies regarding web crawler systems do not consider the bandwidth allocation in such a complex environment; hence, the time constraints may not be satisfied. In this study, we investigate the bandwidth allocation approaches for such a web crawler system. The approaches are designed for two scenarios, i.e., when the number of websites exceeds or does not exceed the maximum number of web crawlers that the system can execute simultaneously. For the latter situation, we propose approaches to control the bandwidth for web crawlers to minimize the maximum complete time or minimize the sum of execution times of all web crawlers, considering assumptions of both sufficient and insufficient bandwidths. For the former situation, we propose a round-based reallocation approach to schedule both the sequence and bandwidth allocation of the web crawlers. Extensive simulations are conducted to validate the proposed approaches, and the results show that our approaches satisfy the time constraints well and achieve desirable execution performances in various scenarios.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Ambient Intelligence and Humanized Computing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.