Abstract
The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: (i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; (ii) define a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling; (iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevance of the result; and (iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.
Highlights
IN the past decade managing, processing and analysing data has changed radically towards establishing datadriven organisations
We study the problem of cost-effectively automating an end-to-end data wrangling process, that is, to integrate, clean, select from a large set of input sources and create a data product that is suitable for downstream analysis by optimising its quality
We describe how automation can be informed by the data context, which consists of data sources D that can be aligned with the target schema, thereby providing partial, potentially erroneous and contradicting instancebased evidence about the target
Summary
IN the past decade managing, processing and analysing data has changed radically towards establishing datadriven organisations. We extend and refine these approaches to use target instances in automation, to provide a comprehensive, end-to-end approach incorporating instance-based evidence from multiple sources in the data context that may be partial or spurious Using this notion we define a domain-independent methodology to apply data context on a potentially large set of steps and specific methods to inform a concrete set of individual steps, with the objective of improving the quality of the wrangling result. 3) A description of how data context can inform multiple steps within an end-to-end wrangling process, matching, mapping validation, value format transformation, rule-based data cleaning and mapping selection to generate and validate candidates with the objective of improving the accuracy, consistency, and relevance of the wrangling result. The paper concludes with related work, conclusions and future work in Sections 5 and 6
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.