Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

P Jaganathan,T Karthikeyan

doi:10.3844/jcssp.2015.120.126

Abstract

With the growing industrial impact over the recent years in computer science, data mining has established itself as one of the most important disciplines. In the fast growing Web and in an appropriate amount of time, locating the resources that are precise and relevant is a huge challenge for the all-purpose single process crawlers, which makes the enhanced and the convincing algorithm in demand. Gradually Large scale search engines frequently update their index and in a timely behavior which are not capable to present such information. In this study a scalable focused crawling is proposed with an incremental parallel Web crawler, the Web pages can be crawled concurrently that are relevant to multiple pre-defined topics. Furthermore, to solve the issue of URL distribution, a compound decision model based on multi-objective decision making method is introduced, which will consider multiple factors synthetically such as load balance and relevance, the update frequency issue can be solved by the local repository decision. The result shows that our proposed system will efficiently produce high quality, relevance and freshness with significantly low memory requirement.

Highlights

A program that retrieves and stores Web pages from the Web is called as a Web crawler
To finish the downloading pages in a reasonable amount of time, a new hypertext resource discovery system is used which is called as a focused crawler, which selectively seek out pages and the set of topics which are relevant pre-defined
The other crawling method will bring out significant waste of time to maintain the data

Summary

Introduction

A program that retrieves and stores Web pages from the Web is called as a Web crawler. To finish the downloading pages in a reasonable amount of time, a new hypertext resource discovery system is used which is called as a focused crawler, which selectively seek out pages and the set of topics which are relevant pre-defined Another new crawler called parallel crawler is proposed which crawl the multiple processes in parallel as said by (Balamurugan et al, 2012) Due to the high dynamic nature of Web documents, to acquire useful information and to integrate data, local repository freshness should be maintained, this makes the web pages to crawl consistently. There is a significant waste of time and space, whenever we make full crawling as said by qiang (Zhu, 2007; Mannar Mannan et al, 2014) To overcome this incremental crawler was proposed, instead of crawling all web pages, it selectively and incrementally updates the local repository. From this it is clear that the crawler should have the certain objectives

Methods

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computer Science	Publication Date: Jan 1, 2015
Citations: 11	License type: cc-by

R Discovery Prime

R Discovery Prime

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science

Lead the way for us

Similar Papers

Grid and Cloud Computing and their Application
...
Scalable Computing Practice and Experience | VOL. 11
, et. al. ...
01 Jan 2009
Scalable Computing Practice and Experience | VOL. 11

Contributors
-
Operations Research | VOL. 59
--
01 Feb 2011
Operations Research | VOL. 59

Letter to the Editor—A Proof of the Optimality of the Shortest Remaining Processing Time Discipline
Linus Schrage
Operations Research | VOL. 16
Linus SchrageLinus Schrage
01 Jun 1968
Operations Research | VOL. 16

Contributors
-
Operations Research | VOL. 59
--
01 Oct 2011
Operations Research | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science