Abstract

Data quality concerns arise when one wants to correct anomalies in a single data source (e.g., duplicate elimination in a file), or when one wants to integrate data coming from multiple sources into a single new data source (e.g., data warehouse construction). Three data quality problems are typically encountered: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyboard errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process.We propose a framework that models a data cleaning application as a directed graph of data transformations. Transformations are divided into four distinct classes: mapping, matching, clustering and merging; and each of them is implemented by a macro-operator. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability to include human interaction explicitly in the process. Finally, we study performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.