Abstract

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: (i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; (ii) define a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling; (iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevance of the result; and (iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.

Highlights

  • IN the past decade managing, processing and analysing data has changed radically towards establishing datadriven organisations

  • We study the problem of cost-effectively automating an end-to-end data wrangling process, that is, to integrate, clean, select from a large set of input sources and create a data product that is suitable for downstream analysis by optimising its quality

  • We describe how automation can be informed by the data context, which consists of data sources D that can be aligned with the target schema, thereby providing partial, potentially erroneous and contradicting instancebased evidence about the target

Read more

Summary

INTRODUCTION

IN the past decade managing, processing and analysing data has changed radically towards establishing datadriven organisations. We extend and refine these approaches to use target instances in automation, to provide a comprehensive, end-to-end approach incorporating instance-based evidence from multiple sources in the data context that may be partial or spurious Using this notion we define a domain-independent methodology to apply data context on a potentially large set of steps and specific methods to inform a concrete set of individual steps, with the objective of improving the quality of the wrangling result. 3) A description of how data context can inform multiple steps within an end-to-end wrangling process, matching, mapping validation, value format transformation, rule-based data cleaning and mapping selection to generate and validate candidates with the objective of improving the accuracy, consistency, and relevance of the wrangling result. The paper concludes with related work, conclusions and future work in Sections 5 and 6

PROBLEM STATEMENT
Data Context Types
DATA CONTEXT INFORMED WRANGLING
Source and Data Context Profiling
Schema Matching
Automating Schema Matching
Context Informed Automation
Value Format Transformation
Automating Value Format Transformations
5: Rd generate test transform rulesðEÞ
Rule-Based Data Repair
Automating Rule Based Data Repair
10: CFDbest CFDd0
Automating Mapping Generation
Multi-Criteria Mapping Selection
Automating Mapping Selection
EXPERIMENTAL EVALUATION
Application Domain and Data
Measuring Wrangling Quality
Data Context
Effect of Data Context on the Wrangling Result
Effect of Multiple Data Context Types
Effect of Different Data Context Types
Effect of Number of Input Sources
Value Format Transformations
Rule Based Data Repair
Schema Mapping Validation
Mapping Selection
Performance Evaluation
Effect of Number of Sources
Effect of Data Source Size
Effect of Wrangling Steps
RELATED WORK
Findings
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.