SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Feng Zhao,Hai Jin,Jingyu Zhou,Chang Nie,Heqing Huang

doi:10.1109/tsc.2015.2414931

Abstract

As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently locate deep-web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. We propose a two-stage framework, namely SmartCrawler, for efficient harvesting deep web interfaces. In the first stage, SmartCrawler performs site-based searching for center pages with the help of search engines, avoiding visiting a large number of pages. To achieve more accurate results for a focused crawl, SmartCrawler ranks websites to prioritize highly relevant ones for a given topic. In the second stage, SmartCrawler achieves fast in-site searching by excavating most relevant links with an adaptive link-ranking. To eliminate bias on visiting some highly relevant links in hidden web directories, we design a link tree data structure to achieve wider coverage for a website. Our experimental results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which efficiently retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Services Computing

Lead the way for us

Journal: IEEE Transactions on Services Computing	Publication Date: Jul 1, 2016
Citations: 79

Similar Papers

2 Way Crawling
Mayuri Anantrao Deshmukh
International Journal of Applied Evolutionary Computation | VOL. 10
Mayuri Anantrao DeshmukhMayuri Anantrao Deshmukh
01 Jul 2019
International Journal of Applied Evolutionary Computation | VOL. 10

Smart crawler for hidden web interfaces
Sunita Sundarde ... P R Rathod
-
Sunita Sundarde, et. al.Sunita Sundarde ... P R Rathod
01 Nov 2016
01 Nov 2016

Enhanced Crawler with Multiple Search Techniques using Adaptive Link-Ranking and Pre-Query Processing
Suchetadevi M Gaikwad ... Sanjay B Thakare
Circulation in Computer Science | VOL. 1
Suchetadevi M Gaikwad, et. al.Suchetadevi M Gaikwad ... Sanjay B Thakare
24 Aug 2016
Circulation in Computer Science | VOL. 1

Crawling and cluster hidden web using crawler framework and fuzzy-KNN
I Gede Surya Rahayuda ... Ni Putu Linda Santiari
-
I Gede Surya Rahayuda, et. al.I Gede Surya Rahayuda ... Ni Putu Linda Santiari
01 Aug 2017
01 Aug 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Services Computing