Crawling Ranked Deep Web Data Sources

Yan Wang,Jianguo Lu,Nannan Pi,Yaxin Li

doi:10.1007/978-3-319-26190-4_26

Abstract

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency df based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58i¾?% and 90i¾?% on average respectively.

Full Text