Estimating data integration and cleaning effort

Sebastian Kruse ,Paolo Papotti ,Felix Naumann

doi:10.5441/002/edbt.2015.07

Abstract

Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Estimating data integration and cleaning effort

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Spatial analysis of residential urban form changes: a detailed-geography approach to monitoring results of urban consolidation policy, exemplified for the Melbourne Metropolitan Area

-

02 Feb 2017
02 Feb 2017

Selection of Healthy and Highly Productive Dairy Cattle
...
-
, et. al. ...
27 May 2018
27 May 2018

Role of Enamel Matrix Derivative in Periodontal and Peri-Implant Defects

-

02 Oct 2020
02 Oct 2020

On a deeper understanding of data-driven approaches in the current framework of wastewater treatment: looking inside the black-box
...
-
, et. al. ...
19 Apr 2021
19 Apr 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Estimating data integration and cleaning effort

Abstract

Talk to us

Similar Papers