Crossing the finish line faster when paddling the data lake with KAYAK

Antonio Maccioni,Riccardo Torlone

doi:10.14778/3137765.3137792

Abstract

Paddling in a data lake is strenuous for a data scientist. Being a loosely-structured collection of raw data with little or no meta-information available, the difficulties of extracting insights from a data lake start from the initial phases of data analysis. Indeed, data preparation, which involves many complex operations (such as source and feature selection, exploratory analysis, data profiling, and data curation), is a long and involved activity for navigating the lake before getting precious insights at the finish line. In this framework, we demonstrate KAYAK, a framework that supports data preparation in a data lake with ad-hoc primitives and allows data scientists to cross the finish line sooner. KAYAK takes into account the tolerance of the user in waiting for the primitives' results and it uses incremental execution strategies to produce informative previews of these results. The framework is based on a wise management of metadata and on features that limit human intervention, thus scaling smoothly when the data lake evolves.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Crossing the finish line faster when paddling the data lake with KAYAK

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Aug 1, 2017
Citations: 16

Similar Papers

KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake
Antonio Maccioni ... Riccardo Torlone
-
Antonio Maccioni, et. al.Antonio Maccioni ... Riccardo Torlone
01 Jan 2018
01 Jan 2018

Analysis of AI based Data Wrangling Methods in Intelligent Knowledge Lakes
D Sasikala ... K Venkatesh Sharma
Journal of Soft Computing Paradigm | VOL. 4
D Sasikala, et. al.D Sasikala ... K Venkatesh Sharma
30 Aug 2022
Journal of Soft Computing Paradigm | VOL. 4

Modeling metadata in data lakes—A generic model
Rebecca Eichler ... Bernhard Mitschang
Data & Knowledge Engineering | VOL. 136
Rebecca Eichler, et. al.Rebecca Eichler ... Bernhard Mitschang
22 Sep 2021
Data & Knowledge Engineering | VOL. 136

Data Governance as Success Factor for Data Science
Paul Brous ... Rutger Krans
-
Paul Brous, et. al.Paul Brous ... Rutger Krans
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Crossing the finish line faster when paddling the data lake with KAYAK

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment