Deep Web Sources Research Articles

Web-scale data integration involves fully automated efforts which lack knowledge of the exact match between data descriptions. In this paper, we introduce schema matching prediction, an assessment mechanism to support schema matchers in the absence of an exact match. Given attribute pair-wise similarity measures, a predictor predicts the success of a matcher in identifying correct correspondences. We present a comprehensive framework in which predictors can be defined, designed, and evaluated. We formally define schema matching evaluation and schema matching prediction using similarity spaces and discuss a set of four desirable properties of predictors, namely correlation, robustness, tunability, and generalization. We present a method for constructing predictors, supporting generalization, and introduce prediction models as means of tuning prediction toward various quality measures. We define the empirical properties of correlation and robustness and provide concrete measures for their evaluation. We illustrate the usefulness of schema matching prediction by presenting three use cases: We propose a method for ranking the relevance of deep Web sources with respect to given user needs. We show how predictors can assist in the design of schema matching systems. Finally, we show how prediction can support dynamic weight setting of matchers in an ensemble, thus improving upon current state-of-the-art weight setting methods. An extensive empirical evaluation shows the usefulness of predictors in these use cases and demonstrates the usefulness of prediction models in increasing the performance of schema matching.

Read full abstract

Deep web search engines face the formidable challenge of retrieving high-quality results from the vast collection of searchable databases. Deep web search is a two-step process of selecting the high-quality sources and ranking the results from the selected sources. Though there are existing methods for both the steps, they assess the relevance of the sources and the results using the query-result similarity. When applied to the deep web these methods have two deficiencies. First is that they are agnostic to the correctness (trustworthiness) of the results. Second, the query-based relevance does not consider the importance of the results and sources. These two considerations are essential for the deep web and open collections in general. Since a number of deep web sources provide answers to any query, we conjuncture that the agreements between these answers are helpful in assessing the importance and the trustworthiness of the sources and the results. For assessing source quality, we compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for the possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source, that we call SourceRank , is calculated as the stationary visit probability of a random walk. For ranking results, we analyze the second-order agreement between the results. Further extending SourceRank to multidomain search, we propose a source ranking sensitive to the query domains. Multiple domain-specific rankings of a source are computed, and these ranks are combined for the final ranking. We perform extensive evaluations on online and hundreds of Google Base sources spanning across domains. The proposed result and source rankings are implemented in the deep web search engine Factal . We demonstrate that the agreement analysis tracks source corruption. Further, our relevance evaluations show that our methods improve precision significantly over Google Base and the other baseline methods. The result ranking and the domain-specific source ranking are evaluated separately.

Read full abstract

Deep Web Sources Research Articles

Related Topics

Articles published on Deep Web Sources

DWSpyder: a new schema extraction method for a deep web integration system

Survey of Techniques for Deep Web Source Selection and Surfacing the Hidden Web Content

Deep Web Integration: the Tip of the Iceberg

Query Recommendation in Hidden Web Search Engine using Web Log Mining Techniques

Schema matching prediction with applications to data source discovery and dynamic ensembling

Assessing relevance and trust of the deep web sources and results based on inter-source agreement

The Ranking of Deep Web Sources Based on Data Quality

Stratified sampling for data mining on the deep web

Data Integration for World Wide Web

An enhanced swarm intelligence clustering-based RBFNN classifier and its application in deep Web sources classification

Utility Maximization Model for Deep Web Source Selection and Integration

Extracting result schema based on query instances in the Deep Web

Service Class Driven Dynamic Data Source Discovery with DynaBot

QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web

Structured databases on the web

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Deep Web Sources Research Articles

Related Topics

Articles published on Deep Web Sources

DWSpyder: a new schema extraction method for a deep web integration system

Survey of Techniques for Deep Web Source Selection and Surfacing the Hidden Web Content

Deep Web Integration: the Tip of the Iceberg

Query Recommendation in Hidden Web Search Engine using Web Log Mining Techniques

Schema matching prediction with applications to data source discovery and dynamic ensembling

Assessing relevance and trust of the deep web sources and results based on inter-source agreement

The Ranking of Deep Web Sources Based on Data Quality

Stratified sampling for data mining on the deep web

Data Integration for World Wide Web

An enhanced swarm intelligence clustering-based RBFNN classifier and its application in deep Web sources classification

Utility Maximization Model for Deep Web Source Selection and Integration

Extracting result schema based on query instances in the Deep Web

Service Class Driven Dynamic Data Source Discovery with DynaBot

QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web

Structured databases on the web