Fine-grained Provenance Research Articles

Streaming applications continuously process data to deliver streams of up-to-date results. Their growing adoption for data analysis in many distributed systems is motivated by their performance (in terms of processing throughput and latency) and their support for easy-to-program distributed and parallel analysis. When streaming applications are designed to detect unusual or critical events (e.g., security- or safety-related), it can be beneficial to maintain the associated source data for further analysis. This can be achieved by fine-grained data provenance, which links each detected event back to the source data that contributed to it, allowing to distinguish and isolate the source data that generated such unusual or critical events.Fine-grained data provenance can be especially useful in cyber-physical systems, such as vehicular networks and smart grids. By enabling the extraction of valuable information from raw sensor data, it could, for instance, reduce data transmission and storage requirements. Since cyber-physical systems can have heterogeneous multi-core architectures, ranging from inexpensive single-board computers to high-end servers, there is a demand for efficient provenance techniques that can take advantage of such parallel architectures with minimal overhead. Motivated by this challenge, we present GeneaLog, a novel fine-grained data provenance technique for data streaming applications. Leveraging the logical dependencies of the data, GeneaLog takes advantage of cross-layer properties of the software stack and incurs a minimal, constant size per-tuple overhead. Furthermore, it allows for a modular and efficient algorithmic implementation using only standard (instrumented) data streaming operators. This is particularly useful to distribute the provenance overheads to operators that can be run in parallel, thus leveraging multi-core architectures. We evaluate two implementations of GeneaLog, one based on Apache Flink, a widely-adopted state-of-the-art Stream Processing Engine, and one based on Liebre, an edge-tailored lightweight Stream Processing Engine. We test them both on vehicular and smart grid applications with single-board embedded devices and a high-end server, also studying how GeneaLog affects their scalability and confirming that it efficiently captures fine-grained provenance data with minimal overhead.

Read full abstract

Fast, massive, and viral data diffused on social media affects a large share of the online population, and thus, the (prospective) information diffusion mechanisms behind it are of great interest to researchers. The (retrospective) provenance of such data is equally important because it contributes to the understanding of the relevance and trustworthiness of the information. Furthermore, computing provenance in a timely way is crucial for particular use cases and practitioners, such as online journalists that promptly need to assess specific pieces of information. Social media currently provide insufficient mechanisms for provenance tracking, publication and generation, while state-of-the-art on social media research focuses mainly on explicit diffusion mechanisms (like retweets in Twitter or reshares in Facebook).The implicit diffusion mechanisms remain understudied due to the difficulties of being captured and properly understood. From a technical side, the state of the art for provenance reconstruction evaluates small datasets after the fact, sidestepping requirements for scale and speed of current social media data. In this paper, we investigate the mechanisms of implicit information diffusion by computing its fine-grained provenance. We prove that explicit mechanisms are insufficient to capture influence and our analysis unravels a significant part of implicit interactions and influence in social media. Our approach works incrementally and can be scaled up to cover a truly Web-scale scenario like major events. We can process datasets consisting of up to several millions of messages on a single machine at rates that cover bursty behaviour, without compromising result quality. By doing that, we provide to online journalists and social media users in general, fine grained provenance reconstruction which sheds lights on implicit interactions not captured by social media providers. These results are provided in an online fashion which also allows for fast relevance and trustworthiness assessment.

Read full abstract

Fine-grained Provenance Research Articles

Articles published on Fine-grained Provenance

Facilitating the Sharing of Electrophysiology Data Analysis Results Through In-Depth Provenance Capture.

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

Provenance Framework for Multi-Depth Querying Using Zero-Information Loss Database

DPDS

Compact, tamper-resistant archival of fine-grained provenance

Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Ceramic exchange and the shifting political landscape in the Valley of Oaxaca, Mexico, 700 BCE-200 CE

GeneaLog: Fine-grained data streaming provenance in cyber-physical systems

CF-PROV: A Content-Rich and Fine-Grained Scientific Workflow Provenance Model

Web-scale provenance reconstruction of implicit information diffusion on social media

Model provenance tracking and inference for integrated environmental modelling

A Distributed System for The Management of Fine-grained Provenance

Efficient Stream Provenance via Operator Instrumentation

An Inference-Based Framework to Manage Data Provenance in Geoscience Applications

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Fine-grained Provenance Research Articles

Articles published on Fine-grained Provenance

Facilitating the Sharing of Electrophysiology Data Analysis Results Through In-Depth Provenance Capture.

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

Provenance Framework for Multi-Depth Querying Using Zero-Information Loss Database

DPDS

Compact, tamper-resistant archival of fine-grained provenance

Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Ceramic exchange and the shifting political landscape in the Valley of Oaxaca, Mexico, 700 BCE-200 CE

GeneaLog: Fine-grained data streaming provenance in cyber-physical systems

CF-PROV: A Content-Rich and Fine-Grained Scientific Workflow Provenance Model

Web-scale provenance reconstruction of implicit information diffusion on social media

Model provenance tracking and inference for integrated environmental modelling

A Distributed System for The Management of Fine-grained Provenance

Efficient Stream Provenance via Operator Instrumentation

An Inference-Based Framework to Manage Data Provenance in Geoscience Applications