Data Integration over Distributed and Heterogeneous Data Endpoints

Samur Araújo

doi:10.4233/uuid:8214a00d-3702-40d4-b989-441c9364b43d

Abstract

Data integration is a broad area encompassing techniques to merge data between data sources. Although there are plenty of efficient and effective methods focusing on data integration over homogeneous data, where instances share the same schema and range of values, their applications over heterogeneous data are less clear. This thesis considers data integration within the environment of the Semantic Web. More particularly, we propose a novel architecture for instance matching that takes into account the particularities of this heterogeneous and distributed setting. Instead of assuming that instances share the same schema, the proposed method operates even when there is no overlap between schemas, apart from a key label that matching instances must share. Moreover, we have considered the distributed nature of the Semantic Web to propose a new architecture for general data integration, which operates on-the-fly and in a pay-as-you-go fashion. We show that our view and the view of the traditional data integration school each only partially address the problem, but together complement each other. We have observed that this unified view gives a better insight into their relative importance and how data integration methods can benefit from their combination. The results achieved in this work are particularly interesting for the Semantic Web and Data Integration communities.

Full Text