Abstract

Schema matching, historically, is a sub-area of Data Integration, responsible for matching relational or semi-structured schemas to facilitate further data integration process. In a standard schema matching scenario, with two schemas, a semi-supervised matching algorithm would generate pairwise table and attribute matches. Having a correct mapping between two schemas enables many data integration scenarios, such as schema integration, data translation, schema evolution, mediated/global schema, reverse-engineering, and others [18], [20], [23], [29], [30].For Web scale datasets with millions of tables from hundreds of thousands of sources, for example WEBTABLES [11], schema matching in its classical format becomes computationally infeasible due to its quadratic complexity in the number of schemas. We make a step forward, by noticing, that such brute-force matching is no longer feasible, and is also not needed at scale. Instead, a scalable solution would be to match only the semantically relevant tables, which are much less numerous.WebLens, a scalable data integration system, first, trains Deep Learning models to find and match semantically similar tables, then derives mediated schemas for these subsets to enable uniform access to all relevant data. In this paper, we focus on a high-level description of the entire process and give an example of query processing. For all experiments in the paper, we use a large-scale structured dataset having more than 15 million of relational Web tables in English coming from more than 248 thousand of Web sources.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call