WebLens: Towards Web-scale Data Integration, Training the Models

Rituparna Khan,Michael Gubanov

doi:10.1109/bigdata50022.2020.9377742

Rituparna Khan, Michael Gubanov

https://doi.org/10.1109/bigdata50022.2020.9377742

Copy DOI

Export

Save

Cite

Publication Date: Dec 10, 2020

Citations: 1

Abstract
Full-Text
Similar Papers

Abstract

Listen

Schema matching, historically, is a sub-area of Data Integration, responsible for matching relational or semi-structured schemas to facilitate further data integration process. In a standard schema matching scenario, with two schemas, a semi-supervised matching algorithm would generate pairwise table and attribute matches. Having a correct mapping between two schemas enables many data integration scenarios, such as schema integration, data translation, schema evolution, mediated/global schema, reverse-engineering, and others [18], [20], [23], [29], [30].For Web scale datasets with millions of tables from hundreds of thousands of sources, for example WEBTABLES [11], schema matching in its classical format becomes computationally infeasible due to its quadratic complexity in the number of schemas. We make a step forward, by noticing, that such brute-force matching is no longer feasible, and is also not needed at scale. Instead, a scalable solution would be to match only the semantically relevant tables, which are much less numerous.WebLens, a scalable data integration system, first, trains Deep Learning models to find and match semantically similar tables, then derives mediated schemas for these subsets to enable uniform access to all relevant data. In this paper, we focus on a high-level description of the entire process and give an example of query processing. For all experiments in the paper, we use a large-scale structured dataset having more than 15 million of relational Web tables in English coming from more than 248 thousand of Web sources.

Full Text