Adaptive focused crawling based on link analysis

Debashis Hati,Biswajit Sahoo,Amritesh Kumar

doi:10.1109/icetc.2010.5529641

Abstract

A web search engine is designed to search for information on the World Wide Web (WWW). Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large spam websites, traditional web crawlers cannot function well to solve this problem. Focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. A focused crawler is a program used for searching information related to some interested topics from the Internet. The main property of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves relevant pages only. As the crawler is only a computer program, it cannot determine how relevant a web page is. The major problem is how to retrieve the maximal set of relevant and quality page. In our proposed approach, we calculate the unvisited URL score based on its Anchor text relevancy, its description in Google search engine and calculate the similarity score of description with topic keywords, cohesive text similarity with topic keywords and Relevancy score of its parent pages. Relevancy score is calculated based on vector space model.

Full Text