VITALIZED BI-LEVEL WEB CRAWLER FOR REMOVAL OF REDUNDANT CONTENT IN DEEP WEB INTERFACE

Supriya.H.S

doi:10.15623/ijret.2016.0516019

Abstract

Search engine are used to search for appropriate data against trillion web pages, which are stored in several different servers. Normal search engine can search information on Shallow Web. Deep web is huge storage area of hidden information which is not indexed by automated search engines. Challenging job is to locate a deep web. Deep Web can efficiently harvest and explore accurate result for user query very quickly. This paper proposes a vitalized bi - level web crawler to analyze deep web interface and also remove redundant content in its database. In the first level, to stay away from tripping a huge number of pages Web Crawler search for core pages in search engines based on sites. For this web crawler will prioritize highly appropriate ones through ranking the sites for a given query. In the second level, Crawler achieves rapid in-site searching by adaptive link-ranking through excavating most appropriate links. Web is comprehended with several copies of equivalent content or equivalent web pages. Thus the incident of duplicate and near-duplicate content happening on the web will be very frequent. Thus removal of redundant content in deep web will be achieved based on parsing the content of one web page and comparing the pared content with other web page content which can save the storage area and bandwidth for a web crawler to crawl a web page.

Full Text