Crawling ranked deep Web data sources

Yan Wang,Jianguo Lu,Jessica Chen,Yaxin Li

doi:10.1007/s11280-016-0410-4

Abstract

In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Crawling ranked deep Web data sources

Abstract

Talk to us

Similar Papers

More From: World Wide Web

Lead the way for us

Journal: World Wide Web	Publication Date: Sep 3, 2016
Citations: 29

Similar Papers

Crawling Ranked Deep Web Data Sources
Yan Wang ... Nannan Pi
-
Yan Wang, et. al.Yan Wang ... Nannan Pi
01 Jan 2015
01 Jan 2015

Answering Cross-Source Keyword Queries over Deep Web Data Sources
Fan Wang ... Gagan Agrawal
-
Fan Wang, et. al.Fan Wang ... Gagan Agrawal
01 Jan 2010
01 Jan 2010

Answering complex structured queries over the deep web
Fan Wang ... Gagan Agrawal
-
Fan Wang, et. al.Fan Wang ... Gagan Agrawal
01 Jan 2010
01 Jan 2010

Extracting Output Metadata from Scientific Deep Web Data Sources
Fan Wang ... Gagan Agrawal
-
Fan Wang, et. al.Fan Wang ... Gagan Agrawal
01 Dec 2009
01 Dec 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Crawling ranked deep Web data sources

Abstract

Talk to us

Similar Papers

More From: World Wide Web