Abstract
BackgroundData scientists spend considerable amounts of time preparing data for analysis. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden.ResultsThis paper presents an architecture in which the data scientist need only describe the intended outcome of the data preparation process, leaving the software to determine how best to bring about the outcome. Key wrangling decisions on matching, mapping generation, mapping selection, format transformation and data repair are taken by the system, and the user need only provide: (i) the schema of the data target; (ii) partial representative instance data aligned with the target; (iii) criteria to be prioritised when populating the target; and (iv) feedback on candidate results. To support this, the proposed architecture dynamically orchestrates a collection of loosely coupled wrangling components, in which the orchestration is declaratively specified and includes self-tuning of component parameters.ConclusionThis paper describes a data preparation architecture that has been designed to reduce the cost of data preparation through the provision of a central role for automation. An empirical evaluation with deep web and open government data investigates the quality and suitability of the wrangling result, the cost-effectiveness of the approach, the impact of self-tuning, and scalability with respect to the numbers of sources.
Highlights
As a result of emerging technological advances that allow capturing, sharing and storing of data at scale, the number and diversity of data sets available to organisations are growing rapidly
Experiment setup To evaluate the behaviour of the system, we used the datasets in Table 2, that comprise a mix of property data extracted from the web and curated open government data
Providing a size for the result could sometimes be difficult for data scientists, we note that working with an unconstrained result size in automated wrangling can lead to ever lower quality tuples being included in the result; there needs to be some constraint on the size of the end data product
Summary
As a result of emerging technological advances that allow capturing, sharing and storing of data at scale, the number and diversity of data sets available to organisations are growing rapidly. This is reflected, for example, in the adoption of data lakes, for which the market is predicted to grow at 28% per year from 2017 to 2023 to $14B.1. It is reported that data scientists typically spend 80% of their time on such tasks.. We show how Volume can be addressed by presenting experiments that involve significant numbers of sources, and accommodate Velicity in terms of rapidly changing sources by automating the creation of data preparation tasks. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.