Abstract
World wide web is a decentralized system that consists of a repository of information on the basis of web pages. These web pages act as a source of information or data in the present analytics world. Web crawlers are used for extracting useful information from web pages for different purposes. Firstly, it is used in web search engines where the web pages are indexed to form a corpus of information and allows the users to query on the web pages. Secondly, it is used for web archiving where the web pages are stored for later analysis phases. Thirdly, it can be used for web mining where the web pages are monitored for copyright purposes. The amount of information processed by the web crawler needs to be improved by using the capabilities of modern parallel processing technologies. In order to solve the problem of parallelism and the throughput of crawling this work proposes to optimize the Crawler4j using the Hadoop MapReduce programming model by parallelizing the processing of large input data. Crawler4j is a web crawler that retrieves useful information about the pages that it visits. The crawler Crawler4j coupled with data and computational parallelism of Hadoop MapReduce programming model improves the throughput and accuracy of web crawling. The experimental results demonstrate that the proposed solution achieves significant improvements with respect to performance and throughput. Hence the proposed approach intends to carve out a new methodology towards optimizing web crawling by achieving significant performance gain.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of The Institution of Engineers (India): Series B
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.