Abstract

The need for more and more specific reply to a web search query has prompted researchers to work on focused web crawling techniques for web spiders. Variety of lexical and link based approaches of focused web crawling are introduced in the paper highlighting important aspects of each.

Highlights

  • Crawler periodically traverses the web and collects information about web documents [6] for search engine to be added to its database and indexed

  • A crawler may employ Breath First(begins at a particular web page and explores all pages that it can reach by using only one hyperlink from the original page. Once it has exhausted all web pages at that one level, it explores all of the web pages that can be reached by following only one hyperlink from any page that was discovered at level one.) or Depth First(A depth-first search proceeds by following a chain of hyperlinks down as far as possible

  • In contrast to breadth-first search, hyperlinks on a given page are not fully exhausted before the crawler goes to the level page.) methods to search the web for new pages.A crawler identifies the location of a document by its URL

Read more

Summary

INTRODUCTION

Crawler periodically traverses the web and collects information about web documents [6] for search engine to be added to its database and indexed. A crawler may employ Breath First(begins at a particular web page and explores all pages that it can reach by using only one hyperlink from the original page. The first generation of crawlers on which most of the web search engines are based rely heavily on traditional graph algorithms, such as breadth-first or depth-first traversal, to index the web(see Fig. 2). A focused crawler explores the web using a best-first search according to a specific topic; i.e. it downloads only topic-relevant documents in its path (see Fig. 3) instead of downloading all links as in case of a general crawler. The classification of crawling techniques to retrieve relevant, high quality web pages is:

LEXICAL BASED APPROACH
LINK BASED APPROACH
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call