VADA: an architecture for end user informed data preparation

Nikolaos Konstantinou,John A Keane,Alex Bogatu,Norman W Paton,Endri Irfanie,Alvaro A A Fernandes,Georg Gottlob,Martin Koehler,Cristina Civili,Lacramioara Mazilu,Emanuel Sallinger,Edward Abel,Luigi Bellomarini

doi:10.1186/s40537-019-0237-9

Nikolaos Konstantinou, John A Keane + Show 11 more

Open Access

https://doi.org/10.1186/s40537-019-0237-9

Copy DOI

Abstract

BackgroundData scientists spend considerable amounts of time preparing data for analysis. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden.ResultsThis paper presents an architecture in which the data scientist need only describe the intended outcome of the data preparation process, leaving the software to determine how best to bring about the outcome. Key wrangling decisions on matching, mapping generation, mapping selection, format transformation and data repair are taken by the system, and the user need only provide: (i) the schema of the data target; (ii) partial representative instance data aligned with the target; (iii) criteria to be prioritised when populating the target; and (iv) feedback on candidate results. To support this, the proposed architecture dynamically orchestrates a collection of loosely coupled wrangling components, in which the orchestration is declaratively specified and includes self-tuning of component parameters.ConclusionThis paper describes a data preparation architecture that has been designed to reduce the cost of data preparation through the provision of a central role for automation. An empirical evaluation with deep web and open government data investigates the quality and suitability of the wrangling result, the cost-effectiveness of the approach, the impact of self-tuning, and scalability with respect to the numbers of sources.

Highlights

As a result of emerging technological advances that allow capturing, sharing and storing of data at scale, the number and diversity of data sets available to organisations are growing rapidly
Experiment setup To evaluate the behaviour of the system, we used the datasets in Table 2, that comprise a mix of property data extracted from the web and curated open government data
Providing a size for the result could sometimes be difficult for data scientists, we note that working with an unconstrained result size in automated wrangling can lead to ever lower quality tuples being included in the result; there needs to be some constraint on the size of the end data product

Summary

Introduction

As a result of emerging technological advances that allow capturing, sharing and storing of data at scale, the number and diversity of data sets available to organisations are growing rapidly. This is reflected, for example, in the adoption of data lakes, for which the market is predicted to grow at 28% per year from 2017 to 2023 to $14B.1. It is reported that data scientists typically spend 80% of their time on such tasks.. We show how Volume can be addressed by presenting experiments that involve significant numbers of sources, and accommodate Velicity in terms of rapidly changing sources by automating the creation of data preparation tasks. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Aug 21, 2019
Citations: 16	License type: open-access

R Discovery Prime

R Discovery Prime

VADA: an architecture for end user informed data preparation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Feedback driven improvement of data preparation pipelines
Nikolaos Konstantinou ... Norman W Paton
Information Systems | VOL. 92
Nikolaos Konstantinou, et. al.Nikolaos Konstantinou ... Norman W Paton
06 Dec 2019
Information Systems | VOL. 92

Efficient Approach for Knowledge Management Using Deep Web Information Retrieval System
Soniya Agrawal
IOSR Journal of Computer Engineering | VOL. 12
Soniya AgrawalSoniya Agrawal
01 Jan 2013
IOSR Journal of Computer Engineering | VOL. 12

Exploring and Analysing Surface, Deep, Dark Web and Attacks
Jabeen Sultana ... Abdul Khader Jilani
-
Jabeen Sultana, et. al.Jabeen Sultana ... Abdul Khader Jilani
01 Jan 2020
01 Jan 2020

An Efficient Mechanism for Deep Web Data Extraction Based on Tree-Structured Web Pattern Matching
B Bazeer Ahamed ... Ayman Yafoz
Wireless Communications and Mobile Computing | VOL. 2022
B Bazeer Ahamed, et. al.B Bazeer Ahamed ... Ayman Yafoz
27 May 2022
Wireless Communications and Mobile Computing | VOL. 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VADA: an architecture for end user informed data preparation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data