Abstract

Centralised data management systems (e.g., data lakes) support queries over multi-source heterogeneous data. However, the query results from multiple sources commonly involve between-source conflicts, which makes query results unreliable and confusing and degrades the usability of centralised data management systems. Therefore, resolving the between-sourced conflicts is one of the most important problems for centralised data management systems. To solve it, many batch data fusion-based methods have been proposed, which require traversing all the data in the centralised data management systems and cause scalability and flexibility issues. To address these issues, this paper explores the problem of on-demand fusion queries, where the between-sourced conflicts are solved with only the query-related data; moreover, we propose an efficient on-demand fusion query framework, FusionQuery, which consists of a query stage and a fusion stage. In the query stage, we frame the heterogeneous data query problem as a knowledge graph matching problem and present a line graph-based method to accelerate it. In the fusion stage, we develop an Expectation Maximization-style algorithm to iteratively updates data veracity and source trustworthiness. Furthermore, we design an incremental estimation method of source trustworthiness to address the lack of sufficient observations. Extensive experiments on two real-world datasets demonstrate that FusionQuery outperforms state-of-the-art data fusion methods in terms of both effectiveness and efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call