Optimizing Crawler4j using MapReduce Programming Model

G M Siddesh,K G Srinivasa,B R Rakshitha,Kavya Suresh,Madhushree Nijagal,K Y Madhuri

doi:10.1007/s40031-016-0267-z

Abstract

World wide web is a decentralized system that consists of a repository of information on the basis of web pages. These web pages act as a source of information or data in the present analytics world. Web crawlers are used for extracting useful information from web pages for different purposes. Firstly, it is used in web search engines where the web pages are indexed to form a corpus of information and allows the users to query on the web pages. Secondly, it is used for web archiving where the web pages are stored for later analysis phases. Thirdly, it can be used for web mining where the web pages are monitored for copyright purposes. The amount of information processed by the web crawler needs to be improved by using the capabilities of modern parallel processing technologies. In order to solve the problem of parallelism and the throughput of crawling this work proposes to optimize the Crawler4j using the Hadoop MapReduce programming model by parallelizing the processing of large input data. Crawler4j is a web crawler that retrieves useful information about the pages that it visits. The crawler Crawler4j coupled with data and computational parallelism of Hadoop MapReduce programming model improves the throughput and accuracy of web crawling. The experimental results demonstrate that the proposed solution achieves significant improvements with respect to performance and throughput. Hence the proposed approach intends to carve out a new methodology towards optimizing web crawling by achieving significant performance gain.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimizing Crawler4j using MapReduce Programming Model

Abstract

Talk to us

Similar Papers

More From: Journal of The Institution of Engineers (India): Series B

Lead the way for us

Similar Papers

Googling for Health Information
Jennifer P D'Auria
Journal of Pediatric Health Care | VOL. 26
Jennifer P D'AuriaJennifer P D'Auria
21 Jun 2012
Journal of Pediatric Health Care | VOL. 26

URL ordering based performance evaluation of Web crawler
Mohd Shoaib ... Ashish K Maurya
-
Mohd Shoaib, et. al.Mohd Shoaib ... Ashish K Maurya
01 Aug 2014
01 Aug 2014

An Ontology Based Web Crawler with a Near-Duplicate Detection System to Improve the Performance of a Web Crawler
Ngulamu Walowe ... Michael Kimwele
International Journal of Technology and Systems | VOL. 9
Ngulamu Walowe, et. al.Ngulamu Walowe ... Michael Kimwele
01 Oct 2024
International Journal of Technology and Systems | VOL. 9

SEED SELECTION BASED WEB CRAWLER FOR WEB PAGE CLASSIFICATION: A SURVEY
...
International Journal of Innovations in Engineering Research and Technology | VOL. 8
, et. al. ...
18 Sep 2021
International Journal of Innovations in Engineering Research and Technology | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing Crawler4j using MapReduce Programming Model

Abstract

Talk to us

Similar Papers

More From: Journal of The Institution of Engineers (India): Series B