A new approach to Web Crawling — DHEKTS Crawler in comparison with various Crawlers

K Thirugnanasambanthan

doi:10.17485/ijst/v14i19.599

Abstract

Objectives: To propose a crawler to visit websites for collecting information and create a search engine index for reference; To compare various crawler License, language used for creation, effectiveness with proposed DHEKTS crawler; To compare various characteristics, tasks and functions with proposed DHEKTS crawler; To identify the merits of the DHEKTS Crawler. Methods: A new Crawler called DHEKTS is developed to filter and synchronize documents like Images, Link, and HTML code from a given website. This Crawler is unique in nature since it returns all the details of a particular website having Images, Links, html code and contents. It can crawl through links in a specified website and crawl further to other links on the website. The DHEKTS Crawler is designed for Depth and Relevance crawling. The entire DHEKTS crawler has a few crawling mechanism supporting variety of information. The requirements are Operating System: Win 7 and higher, Front End: PHP, BackEnd: MySQL, RAM: Minimum 4GB and SERVER: High Speed Server with good storage Capacity. Findings: The DHEKTS Crawler has brought web related Links, Images, HTML Code, Information about to fifth level of crawling and Relevance Search giving relevant information. Multiple crawlers fulfill the major functions of crawling but DHEKTS CRAWLER is built to execute all functions in one crawler. Applications: This is applied in Crawling of various Websites and to retrieve valuable data. Keywords: Crawler; DHEKTS Crawler; License; tasks; functions; effectiveness; Comparison

Highlights

A web crawler systematically browses WWW for the purpose of indexing
To compare various crawler License, language used for creation, effectiveness with proposed DHEKTS crawler
This paper is about a new approach in web crawling using DHEKTS Crawler which is quite different from prominent crawlers

Summary

Introduction

A web crawler systematically browses WWW for the purpose of indexing. Using crawler, the web search engines updates web content, index other sites. It is understood that the features of one crawler is not in other crawler and implementing all features in one crawler is not done This problem is identified for this study to build a unique crawler to systematically browse WWW for indexing information, supporting multiple features of crawling like bringing links, images, HTML Source, Depth Crawling(10). A new Crawler called DHEKTS is developed to filter and synchronize documents like Images, Link, and HTML code from a given website. This Crawler is unique in nature since it returns all the details of a particular website having Images, Links, Files and details of any website. This Crawler works based on search keywords, no. of keywords present in a particular website, user relevance rating is given to the website

Objectives

Architecture of Dhekts Crawler

Results and Discussion

Scalable

Distributed