Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Martin Koehler,Edward Abel,Alvaro A.A Fernandes,John Keane,Nikolaos Konstantinou,Lacramioara Mazilu,Leonid Libkin,Norman W Paton,Alex Bogatu,Cristina Civili

doi:10.1109/tbdata.2019.2907588

Martin Koehler, Edward Abel + Show 8 more

Open Access

https://doi.org/10.1109/tbdata.2019.2907588

Copy DOI

Abstract

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: (i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; (ii) define a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling; (iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevance of the result; and (iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.

Highlights

IN the past decade managing, processing and analysing data has changed radically towards establishing datadriven organisations
We study the problem of cost-effectively automating an end-to-end data wrangling process, that is, to integrate, clean, select from a large set of input sources and create a data product that is suitable for downstream analysis by optimising its quality
We describe how automation can be informed by the data context, which consists of data sources D that can be aligned with the target schema, thereby providing partial, potentially erroneous and contradicting instancebased evidence about the target

Summary

INTRODUCTION

IN the past decade managing, processing and analysing data has changed radically towards establishing datadriven organisations. We extend and refine these approaches to use target instances in automation, to provide a comprehensive, end-to-end approach incorporating instance-based evidence from multiple sources in the data context that may be partial or spurious Using this notion we define a domain-independent methodology to apply data context on a potentially large set of steps and specific methods to inform a concrete set of individual steps, with the objective of improving the quality of the wrangling result. 3) A description of how data context can inform multiple steps within an end-to-end wrangling process, matching, mapping validation, value format transformation, rule-based data cleaning and mapping selection to generate and validate candidates with the objective of improving the accuracy, consistency, and relevance of the wrangling result. The paper concludes with related work, conclusions and future work in Sections 5 and 6

PROBLEM STATEMENT

Data Context Types

DATA CONTEXT INFORMED WRANGLING

Source and Data Context Profiling

Schema Matching

Automating Schema Matching

Context Informed Automation

Value Format Transformation

Automating Value Format Transformations

5: Rd generate test transform rulesðEÞ

Rule-Based Data Repair

Automating Rule Based Data Repair

10: CFDbest CFDd0

Automating Mapping Generation

Multi-Criteria Mapping Selection

Automating Mapping Selection

EXPERIMENTAL EVALUATION

Application Domain and Data

Measuring Wrangling Quality

Data Context

Effect of Data Context on the Wrangling Result

Effect of Multiple Data Context Types

Effect of Different Data Context Types

Effect of Number of Input Sources

Value Format Transformations

Rule Based Data Repair

Schema Mapping Validation

Mapping Selection

Performance Evaluation

Effect of Number of Sources

Effect of Data Source Size

Effect of Wrangling Steps

RELATED WORK

Findings

CONCLUSIONS

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Big Data	Publication Date: May 9, 2019
Citations: 51	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Big Data

Lead the way for us

Similar Papers

Data context informed data wrangling
Martin Koehler ... Cristina Civili
-
Martin Koehler, et. al.Martin Koehler ... Cristina Civili
01 Dec 2017
01 Dec 2017

The Measurement and Modelling of Commercial Real Estate Performance
P M Booth ... G Marcato
British Actuarial Journal | VOL. 10
P M Booth, et. al.P M Booth ... G Marcato
01 Apr 2004
British Actuarial Journal | VOL. 10

Data wrangling practices and collaborative interactions with aggregated data
Shiyan Jiang ... Jennifer Kahn
International Journal of Computer-Supported Collaborative Learning | VOL. 15
Shiyan Jiang, et. al.Shiyan Jiang ... Jennifer Kahn
26 Aug 2020
International Journal of Computer-Supported Collaborative Learning | VOL. 15

Trajectory-based visual analysis of large financial time series data
Tobias Schreck ... Tatiana Tekušová
ACM SIGKDD Explorations Newsletter | VOL. 9
Tobias Schreck, et. al.Tobias Schreck ... Tatiana Tekušová
01 Dec 2007
ACM SIGKDD Explorations Newsletter | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Big Data