Abstract

Search engine are used to search for appropriate data against trillion web pages, which are stored in several different servers. Normal search engine can search information on Shallow Web. Deep web is huge storage area of hidden information which is not indexed by automated search engines. Challenging job is to locate a deep web. Deep Web can efficiently harvest and explore accurate result for user query very quickly. This paper proposes a vitalized bi - level web crawler to analyze deep web interface and also remove redundant content in its database. In the first level, to stay away from tripping a huge number of pages Web Crawler search for core pages in search engines based on sites. For this web crawler will prioritize highly appropriate ones through ranking the sites for a given query. In the second level, Crawler achieves rapid in-site searching by adaptive link-ranking through excavating most appropriate links. Web is comprehended with several copies of equivalent content or equivalent web pages. Thus the incident of duplicate and near-duplicate content happening on the web will be very frequent. Thus removal of redundant content in deep web will be achieved based on parsing the content of one web page and comparing the pared content with other web page content which can save the storage area and bandwidth for a web crawler to crawl a web page.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.