Stratified sampling for data mining on the deep web

Tantan Liu,Gagan Agrawal,Fan Wang

doi:10.1007/s11704-012-2859-3

Abstract

In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Stratified sampling for data mining on the deep web

Abstract

Talk to us

Similar Papers

More From: Frontiers of Computer Science

Lead the way for us

Journal: Frontiers of Computer Science	Publication Date: Mar 31, 2012
Citations: 15

Similar Papers

Stratified Sampling for Data Mining on the Deep Web
Tantan Liu ... Gagan Agrawal
-
Tantan Liu, et. al.Tantan Liu ... Gagan Agrawal
01 Dec 2010
01 Dec 2010

Stratification Based Hierarchical Clustering Over a Deep Web Data Source
Tantan Liu ... Gagan Agrawal
-
Tantan Liu, et. al.Tantan Liu ... Gagan Agrawal
26 Apr 2012
26 Apr 2012

Quality-based data source selection for web-scale Deep Web data integration
Xue-Feng Xian ... Zhi-Ming Cui
-
Xue-Feng Xian, et. al. Xue-Feng Xian ... Zhi-Ming Cui
01 Jul 2009
01 Jul 2009

Active learning based frequent itemset mining over the deep web
Tantan Liu ... Gagan Agrawal
-
Tantan Liu, et. al.Tantan Liu ... Gagan Agrawal
01 Apr 2011
01 Apr 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stratified sampling for data mining on the deep web

Abstract

Talk to us

Similar Papers

More From: Frontiers of Computer Science