Abstract

In the big data era, data integration is becoming increasingly important. It is usually handled by data flows processes that extract, transform, and clean data from several sources, and populate the data integration system (DIS). Designing data flows is facing several challenges. In this article, we deal with data quality issues such as (1) specifying a set of quality rules, (2) enforcing them on the data flow pipeline to detect violations, and (3) producing accurate repairs for the detected violations. We propose QDflows , a system for designing quality-aware data flows that considers the following as input: (1) a high-quality knowledge base (KB) as the global schema of integration, (2) a set of data sources and a set of validated users’ requirements, (3) a set of defined mappings between data sources and the KB, and (4) a set of quality rules specified by users. QDflows uses an ontology to design the DIS schema. It offers the ability to define the DIS ontology as a module of the knowledge base, based on validated users’ requirements. The DIS ontology model is then extended with multiple types of quality rules specified by users. QDflows extracts and transforms data from sources to populate the DIS. It detects violations of quality rules enforced on the data flows, constructs repair patterns, searches for horizontal and vertical matches in the knowledge base, and performs an automatic repair when possible or generates possible repairs. It interactively involves users to validate the repair process before loading the clean data into the DIS. Using real-life and synthetic datasets, the DBpedia and Yago knowledge bases, we experimentally evaluate the generality, effectiveness, and efficiency of QDflows. We also showcase an interactive tool implementing our system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.