Abstract

With the growing industrial impact over the recent years in computer science, data mining has established itself as one of the most important disciplines. In the fast growing Web and in an appropriate amount of time, locating the resources that are precise and relevant is a huge challenge for the all-purpose single process crawlers, which makes the enhanced and the convincing algorithm in demand. Gradually Large scale search engines frequently update their index and in a timely behavior which are not capable to present such information. In this study a scalable focused crawling is proposed with an incremental parallel Web crawler, the Web pages can be crawled concurrently that are relevant to multiple pre-defined topics. Furthermore, to solve the issue of URL distribution, a compound decision model based on multi-objective decision making method is introduced, which will consider multiple factors synthetically such as load balance and relevance, the update frequency issue can be solved by the local repository decision. The result shows that our proposed system will efficiently produce high quality, relevance and freshness with significantly low memory requirement.

Highlights

  • A program that retrieves and stores Web pages from the Web is called as a Web crawler

  • To finish the downloading pages in a reasonable amount of time, a new hypertext resource discovery system is used which is called as a focused crawler, which selectively seek out pages and the set of topics which are relevant pre-defined

  • The other crawling method will bring out significant waste of time to maintain the data

Read more

Summary

Introduction

A program that retrieves and stores Web pages from the Web is called as a Web crawler. To finish the downloading pages in a reasonable amount of time, a new hypertext resource discovery system is used which is called as a focused crawler, which selectively seek out pages and the set of topics which are relevant pre-defined Another new crawler called parallel crawler is proposed which crawl the multiple processes in parallel as said by (Balamurugan et al, 2012) Due to the high dynamic nature of Web documents, to acquire useful information and to integrate data, local repository freshness should be maintained, this makes the web pages to crawl consistently. There is a significant waste of time and space, whenever we make full crawling as said by qiang (Zhu, 2007; Mannar Mannan et al, 2014) To overcome this incremental crawler was proposed, instead of crawling all web pages, it selectively and incrementally updates the local repository. From this it is clear that the crawler should have the certain objectives

Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.