Multi-level Frontier based Topic-specific Crawler Design with Improved URL Ordering

Akilandeswari Jeyapal,Gopalan Palanisamy

doi:10.5539/cis.v1n4p99

Abstract

The rapid growth of World Wide Web has urged the development of retrieval tools like search engines. Topic specific crawlers are best suited for the users looking for results on a particular subject. In this paper, a novel design of a topic specific web crawler based on multi-agent system is presented. The architecture proposed employs two types of agents: retrieval and coordinator agents. Coordinator agent is responsible for disseminating URLs from crawling frontiers to individual retrieval agents. The URL frontier is modeled as multi-level queues to implement tunneling and is populated with URLs by a rule based engine. The coordinator agent dynamically assigns URLs to retrieval agents to avoid downloading non productive and duplicate Web pages. The empirical results clearly depict the advantage of using multi-level frontier queues in terms of harvest ratio, time, and downloading highly relevant Web pages.

Highlights

The World Wide Web (WWW) or Web can be sighted as a huge distributed database across several million numbers of hosts over the Internet where data are stored as Web pages on Web servers
As the size of WWW is colossal, search engines become a vital tool to search for information
JADE is one of the promising agent development frameworks supporting the deployment of multiple agents

Summary

Introduction

The World Wide Web (WWW) or Web can be sighted as a huge distributed database across several million numbers of hosts over the Internet where data are stored as Web pages on Web servers. Users can navigate through the Web pages with the help of browsers which is time consuming Another way to locate information is through search engines. Since a single agent approach may be inefficient and impractical for the large scale IR environment, most of the systems employ multi-agent systems This paradigm has become more and more important as they represent a new way of analyzing, designing, and implementing complex software systems. An architectural framework is presented for crawling topic specific Web pages using multi-agent based system. The crawling agents are incorporated with intelligence which guides them in deciding the appropriate URL to download This feature is enabled with the help of an enhanced rule based system employing multi-level crawl frontier.

Related Works

Architecture

Findings

Discussion

Conclusions