Abstract

Schema matching is the process of establishing correspondences between the attributes of database schemas for data integration purposes. Although several automatic schema matching tools have been developed, their results are often incomplete or erroneous. To obtain a correct set of correspondences, usually human e ort is required to validate the generated correspondences. This validation process is often costly, as it is performed by highly skilled experts. Our paper analyzes how to leverage crowdsourcing techniques to validate the generated correspondences by a large group of non-experts. In our work we assume tha t one needs to establish attribute correspondences not only between two schemas but in a network. W e also assume that the matching is realized in a pairwise f ashion, in the presence of consistency expectations about the network of attribute correspondences. We demonstrate that formulating these expectations in the form of integrity constraints can improve the process of reconciliation. As in the case of crowdsourcing the user’s input is unreliable, we need specific aggregation techniques to obtain good quality. We demonstrate that consistency constraints can not only improve the quality of aggregated answers, but they also enable us to more reliably estimate the quality answers of individual workers and detect spammers. Moreover, these constraints also enable to minimize the necessary human e ort needed, for the same expected quality of results.

Highlights

  • More and more online services enable users to upload and share structured data, including Google Fusion Tables [1], Freebase [2], and Factual [3]

  • In the following we introduce the schema matching network model [7] that we we will use in our work

  • We review salient work in schema matching and crowdsourcing areas that are related to our research

Read more

Summary

Introduction

More and more online services enable users to upload and share structured data, including Google Fusion Tables [1], Freebase [2], and Factual [3]. These services primarily offer easy visualization of uploaded data as well as tools to embed the visualization to blogs or Web pages. An example is the often quoted coffee consumption data found in Google Fusion Tables, which is distributed among different tables that represent a specific region [1]. Extraction of information over all regions requires means to query and aggregate across multiple tables, thereby raising the need of interconnecting schemas to achieve an integrated view of the data. The number of publicly available datasets grows rapidly, making the integration more and more challenging

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.