Abstract
Inspired by the concept of internet computing, DHT-based distributed Web crawling model is proposed to solve the bottlenecks of the traditional Web crawling systems. Based on this system model, we propose optimizations to reduce the download time of the Web crawling tasks in order to increase the efficiency of the system. The improvement on the download time is achieved by shortening the crawler-crawlee network distance. By utilizing the mapping mechanism of Content Addressable Network (CAN) over Network Coordinate System (NC), the issue can be mapped onto a problem of minimizing the distances between peers and resources on the DHT overlay. This paper focuses on reducing such distances, seeking to provide an improved location-aware infrastructure for distributed Web crawling. A new DHT-based distributed Web crawling model is proposed first. Then, under this model, a new method based on CAN’s splitting schemes is proposed which shows a significant decrease in crawler-crawlee distance against existing schemes. In addition, the issue of load balancing is also solved by combining the new method with old ones.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.