Capturing and Querying Structural Provenance in Spark with Pebble

Ralf Diestelkämper,Melanie Herschel

doi:10.1145/3299869.3320225

Abstract

Analyzing and debugging Spark processing pipelines is a tedious task which typically involves a lot of engineering effort. The task becomes even more complex when the pipelines process nested data. Provenance solutions that track the derivation process of individual data items assist data engineers while debugging these pipelines. However, state-of-the-art solutions do not precisely track nested data items. We demonstrate Pebble, a system for capturing and querying a new type of provenance on nested data in Spark called structural provenance. It captures access and modification of top-level as well as nested data items, and allows querying the provenance of nested items based on tree-pattern-matching. Implemented as a standalone library on top of Apache Spark, it seamlessly leverages the underlying infrastructure for scalability. Through the graphical user interface implemented in a Jupyter notebook we showcase ten debugging scenarios of Spark programs on real-world datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Capturing and Querying Structural Provenance in Spark with Pebble

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A Provenance-Aware Access Control Framework with Typed Provenance
Lianshan Sun ... Dang Nguyen
IEEE Transactions on Dependable and Secure Computing | VOL. 13
Lianshan Sun, et. al.Lianshan Sun ... Dang Nguyen
01 Jul 2016
IEEE Transactions on Dependable and Secure Computing | VOL. 13

A survey on provenance: What for? What form? What from?
Melanie Herschel ... Ralf Diestelkämper
The VLDB Journal | VOL. 26
Melanie Herschel, et. al.Melanie Herschel ... Ralf Diestelkämper
16 Oct 2017
The VLDB Journal | VOL. 26

In-Memory Indexed Caching for Distributed Data Processing
Alexandru Uta ... Jan Rellermeyer
-
Alexandru Uta, et. al.Alexandru Uta ... Jan Rellermeyer
01 May 2022
01 May 2022

Privacy-enhanced attribute-based private information retrieval
Jianchang Lai ... Willy Susilo
Information Sciences | VOL. 454-455
Jianchang Lai, et. al.Jianchang Lai ... Willy Susilo
01 May 2018
Information Sciences | VOL. 454-455

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Capturing and Querying Structural Provenance in Spark with Pebble

Abstract

Talk to us

Similar Papers