Abstract

A vast collection of information, which is developed over a period, using HTML (Hyper Text Markup Language) formatted documents that are interlinked to each other is called as World Wide Web (WWW). With the increasing size of World Wide Web (WWW), obtaining meaningful information from the web is becoming a tedious task. Search Engine is developed for extracting information from the web. It works as an interface between the web and the user. The three important components of a search engine are: Crawler, Indexer and Page Ranking. The ambiguity of data along with its vast availability on the web is increasing at a greater pace due to the tremendous increase in data on the World Wide Web (WWW). For a naïve user this implies a challenge while surfing on the web for retrieving relevant and required information based on the search. Crawler is a component of search engine responsible for traversing webpages and fetching relevant links from the web. This represents huge dependency of any search engine on the crawlers. So, a detailed study about all the available crawlers for understanding the drawbacks and the insights about the working methodology undertaken is necessary before proceeding to develop a smart crawler. For developing a smart crawler which is future scope, a comparative analysis of widely used crawlers like Focused Crawler, Inference-based Crawler, Incremental Crawler, Parallel Crawler and Distributed Crawler is done.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.