Design of a Metacrawler for web document retrieval

K R Remesh Babu,A P Arya

doi:10.1109/isda.2012.6416585

Abstract

Web Crawlers ‘browse’ the World Wide Web (WWW) on behalf of search engine, to collect web pages from numerous collections of billions of documents. Metacrawler is similar to that of a meta search engine that combines the top web search results from popular search engines. World Wide Web is growing rapidly. This possesses great challenges to general purpose crawlers. This paper introduces an architectural framework of a Metacrawler. This crawler enables the user to retrieve information that is relevant to the topic from more than one traditional web search engines. The crawler works in such a way that it fetches only the pages that are relevant to the topic. The PageRank algorithm is often used in ranking web pages. But, the ranking causes the problem of topic-drift. So, modified PageRank algorithm is used to rank the retrieved web pages in such a way that it reduces this problem. The clustering method is used to combine the search results so that the user can easily select web pages from the clustered results based upon the requirement. Experimental results show the effectiveness of the Metacrawler.

Full Text