Provenance in Databases: Why, How, and Where

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Different notions of provenance for database queries have been proposed and studied in the past few years. In this article, we detail three main notions of database provenance, some of their applications, and compare and contrast amongst them. Specifically, we review why, how, and where provenance, describe the relationships among these notions of provenance, and describe some of their applications in confidence computation, view maintenance and update, debugging, and annotation propagation.

Similar Papers
  • Conference Article
  • Cite Count Icon 11
  • 10.1145/2835043.2835047
Data Provenance for Historical Queries in Relational Database
  • Oct 29, 2015
  • Asma Rani + 2 more

Capturing, modeling, and querying data provenance in databases has gained considerable importance in the last decade. All kinds of applications developed on top of databases, now a days collect provenance for various purposes like trustworthiness of data, update management, quality measurement etc. For these purposes, there is a need to efficiently capture, store, and query provenance information for current as well as historical queries executed on the database. Most of the existing provenance models like DBNotes, MONDRIAN, Perm, Orchestra, TRIO, and GProM are suitable for capturing and querying provenance in relational databases. All these models can capture provenance only for currently executing queries, except for TRIO and GProM, which can capture and query provenance for historical queries also. But, the time and space complexity of these two models is very high. In this paper, we propose a framework, Data Provenance for Historical Queries (DPHQ), which is capable of efficiently capturing and querying provenance for queries, including that of historical queries. The proposed model also supports provenance for updates. In our model, we have used Zero Information Loss Database [2] to execute historical queries at any point of time, using the concept of nested relations. A graph database is used for storing and subsequent querying of provenance information.

  • Book Chapter
  • Cite Count Icon 8
  • 10.1007/978-3-030-31423-1_3
Provenance in Databases: Principles and Applications
  • Jan 1, 2019
  • Pierre Senellart

Data provenance is extra information computed during query evaluation over databases, which provides additional context about query results. Several formal frameworks for data provenance have been proposed, in particular based on provenance semirings. The provenance of a query can be computed in these frameworks for a variety of query languages. Provenance has applications in various settings, such as probabilistic databases, view maintenance, or explanation of query results. Though the theory of provenance semirings has mostly been developed in the setting of relational databases, it can also apply to other data representations, such as XML, graph, and triple-store databases.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/tkde.2023.3265840
Summarizing Provenance of Aggregate Query Results in Relational Databases
  • Oct 1, 2023
  • IEEE Transactions on Knowledge and Data Engineering
  • Omar Alomeir + 3 more

Data provenance is any information about the origin of a piece of data and the process that led to its creation. Most database provenance work has focused on creating models and semantics to query and generate this provenance information. While comprehensive, provenance information remains large and overwhelming, making it hard for data provenance systems to support data exploration. We present a new approach to provenance exploration that builds on data summarization techniques. We contribute novel summarization schemes for the provenance of aggregation queries and techniques for the fast generation of these summarization schemes. We introduce two types of summaries for aggregate queries. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Impact summaries</i> take into account the impact of specific groups of tuples in the provenance of the query on an aggregate result, and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">comparative summaries</i> allow users to compare the provenance of two aggregate results. We also present algorithms for efficient computation of these summaries, implement optimizations using data sampling and feature selection, and conduct experiments and a user survey to show the feasibility and relevance of our approaches.

  • Conference Article
  • Cite Count Icon 13
  • 10.1145/1739041.1739043
Provenance for database transformations
  • Mar 22, 2010
  • Val Tannen

Database transformations (queries, views, mappings) take apart, filter, and recombine source data in order to populate warehouses, materialize views, and provide inputs to analysis tools. As they do so, applications often need to track the relationship between parts and pieces of the sources and parts and pieces of the transformations' output. This relationship is what we call database provenance.This talk presents an approach to database provenance that is based on two observations. First, provenance is a kind of annotation, and we can develop a general approach to annotation propagation that also covers other applications, for example to uncertainty and access control. In fact, provenance turns out to be the most general kind of such annotation, in a precise and practically useful sense. Second, the propagation of annotation through a broad class of transformations relies on just two operations: one when annotations are jointly used and one when they are used alternatively. This leads to annotations forming a specific algebraic structure, a commutative semiring.The semiring approach works for annotating tuples, field values and attributes in standard relations, in nested relations (complex values), and for annotating nodes in (unordered) XML. It works for transformations expressed in the positive fragment of relational algebra, nested relational calculus, unordered XQuery, as well as for Datalog, GLAV schema mappings, and tgd constraints. Specific semirings correspond to earlier approaches to provenance, while others correspond to forms of uncertainty, trust, cost, and access control.This is joint work with J. N. Foster, T. J. Green, Z. Ives, and G. Karvounarakis, done in part within the frameworks of the Orchestra and pPOD projects.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/3-540-36108-1_3
The Lord of the Rings: Efficient Maintenance of Views at Data Warehouses
  • Jan 1, 2002
  • D Agrawal + 4 more

Data warehouses have become extremely important to support online analytical processing (OLAP) queries in databases. Since the data view that is obtained at a data warehouse is derived from multiple data sources that are continuously updated, keeping a data warehouse up-to-date becomes a crucial problem. An approach referred to as the incremental view maintenance is widely used. Unfortunately, a precise and formal definition of view maintenance (which can actually be seen as a distributed computation problem) does not exist. This paper develops a formal model for maintaining views at data warehouses in a distributed asynchronous system. We start by formulating the view maintenance problem in terms of abstract update and data integration operations and state the notions of correctness associated with data warehouse views. We then present a basic protocol and establish its proof of correctness. Finally, we present an efficient version of the proposed protocol by incorporating several optimizations. So, this paper is mainly concerned with basic principles of distributed computing and their use to solve database related problems.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/3-540-63792-3_7
Maintaining constrained transitive closure by conjunctive queries
  • Jan 1, 1997
  • Guozhu Dong + 1 more

Recently there has been considerable effort on the maintenance of database views in general, and of deductive database views in particular. Such work considered the incremental maintenance of deductive views defined by expensive database queries, including the transitive closure query. An interesting approach is to use only first-order queries in this maintenance, after small changes (e.g. one tuple insertions and deletions) to base relations. In this paper we consider such incremental maintenance of views defined by constrained transitive closure queries, for the insertion case. The constraints may refer to node costs such as height, voltage and temperature at spatial locations of interest, and to edge costs such as distances between spatial locations. When the constraints are based on node costs, we divide the maintenance problem into several cases, depending on the nature of the constraints. For some cases, the constrained transitive closure becomes bounded in the sense that the recursion can be removed and thus the maintenance problem becomes trivial. For the other cases we provide complete solutions for maintaining the constrained transitive closure. The size of the auxiliary relations can be kept bounded by the size of the transitive closure of the graphs. We also illustrate how to maintain the constrained transitive closure in the presence of other kinds of constraints. However, whether such views can be maintained in first-order after the deletion of edges is a major open problem, even for the constraint-free transitive closure of directed graphs.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1007/bfb0100997
Integration of incremental view maintenance into query optimizers
  • Jan 1, 1998
  • Dimitra Vista

We report on our experiences in integrating view maintenance policies into a database query optimizer. We present the design, implementation and use of the RHODES query optimizer. RHODES is responsible for the generation of the maintenance expressions to be used for the maintenance of views, as well as for the generation of execution plans for their execution. We also discuss a variety of optimizations that RHODES applies during view maintenance and change propagation. We demonstrate the effectiveness of the proposed optimizations by experiments performed on the TPC-D database. The experiments also demonstrate the cost tradeoffs amongst multiple maintenance policies for a view.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/1754239.1754259
Analysis of declarative updates
  • Mar 22, 2010
  • Michael Benedikt

Declarative XML update languages are harder to analyze than queries. Static type inference and type checking are certainly more difficult, and even more basic effect analysis problems are complex -- what parts of a document does an update impact? I will begin by surveying the previous results on analysis of XML updates, and their relation to problems in XPath/XQuery. I will then focus on one interaction problem: do an update and a query interact? I will explain why this problem lies at the core of many optimization problems, particularly view maintenance under declarative updates and minimization of number of passes in update evaluation.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/s0306-4379(99)00024-1
DLP: A description logic for extracting and managing complex terminological and structural properties from database schemes
  • Jul 1, 1999
  • Information Systems
  • Luigi Palopoli + 2 more

DLP: A description logic for extracting and managing complex terminological and structural properties from database schemes

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icde51399.2021.00183
Summarizing Provenance of Aggregate Query Results in Relational Databases
  • Apr 1, 2021
  • Omar Alomeir + 3 more

Data provenance is any information about the origin of a piece of data and the process that led to its creation. Most database provenance work has focused on creating models and semantics to query and generate this information. While comprehensive, provenance information remains large and overwhelming, which can make it hard for provenance systems to support data exploration. We present a new approach to provenance exploration that builds on data summarization techniques. We contribute two novel summarization schemes for the provenance of aggregation queries: Impact summaries, and comparative summaries. We show with experiments that our techniques incur little overhead compared to basic summaries. We conduct a survey to show that our approaches are useful to users.

  • Research Article
  • 10.1142/s0219622022500845
Provenance Framework for Multi-Depth Querying Using Zero-Information Loss Database
  • Nov 30, 2022
  • International Journal of Information Technology &amp; Decision Making
  • Asma Rani + 2 more

Data provenance is a kind of metadata that describes the origin and derivation history of data. It provides the information about various direct and indirect sources of data and different transformations applied on it. Provenance information are beneficial in determining the quality, truthfulness, and authenticity of data. It also explains how, when, why, and by whom this data are created. In a relational database, fine-grained provenance captured at different stages (i.e., multi-layer provenance) is more significant and explanatory as it provides various remarkable information such as immediate and intermediate sources and origin of data. In this paper, we propose a novel multi-layer data provenance framework for Zero-Information Loss Relational Database (ZILRDB). The proposed framework is implemented on top of the relational database using the object relational database concepts to maintain all insert, delete, and update operations efficiently. It has the capability to capture multi-layer provenance for different query sets including historical queries. We also propose Provenance Relational Algebra (PRA) as an extension of traditional relational algebra to capture the provenance for ASPJU (Aggregate, Select, Project, Join, Union) queries in relational database. The framework provides a detailed provenance analysis through multi-depth provenance querying. We store the provenance data in both relational and graph database, and further evaluate the performance of the framework in terms of provenance storage overhead and average execution time for provenance querying. We observe that the graph database offers significant performance gains over relational database for executing multi-depth queries on provenance. We present two use case studies to explain the usefulness of proposed framework in various data-driven systems to increase the understandability of system’s behavior and functionalities.

  • Conference Article
  • Cite Count Icon 15
  • 10.1145/2643135.2643143
Database Queries that Explain their Work
  • Sep 8, 2014
  • James Cheney + 2 more

Provenance for database queries or scientific workflows is often motivated as\nproviding explanation, increasing understanding of the underlying data sources\nand processes used to compute the query, and reproducibility, the capability to\nrecompute the results on different inputs, possibly specialized to a part of\nthe output. Many provenance systems claim to provide such capabilities;\nhowever, most lack formal definitions or guarantees of these properties, while\nothers provide formal guarantees only for relatively limited classes of\nchanges. Building on recent work on provenance traces and slicing for\nfunctional programming languages, we introduce a detailed tracing model of\nprovenance for multiset-valued Nested Relational Calculus, define trace slicing\nalgorithms that extract subtraces needed to explain or recompute specific parts\nof the output, and define query slicing and differencing techniques that\nsupport explanation. We state and prove correctness properties for these\ntechniques and present a proof-of-concept implementation in Haskell.\n

  • Research Article
  • Cite Count Icon 3
  • 10.11591/telkomnika.v11i7.2827
Key Technologies and Applications of Secure Multiparty Computation
  • Jul 1, 2013
  • TELKOMNIKA Indonesian Journal of Electrical Engineering
  • Xiaoqiang Guo + 2 more

With the advent of the information age, the network security is particularly important. The secure multiparty computation is a very important branch of cryptography. It is a hotspot in the field of information security. It expanded the scope of the traditional distributed computing and information security, provided a new computing model for the network collaborative computing. First we introduced several key technologies of secure multiparty computation: secret sharing and verifiable secret sharing, homomorphic public key cryptosystem, mix network, zero knowledge proof, oblivious transfer, millionaire protocol. Second we discussed the applications of secure multiparty computation in electronic voting, electronic auctions, threshold signature, database queries, data mining, mechanical engineering and other fields. DOI: http://dx.doi.org/10.11591/telkomnika.v11i7.2827

  • Research Article
  • Cite Count Icon 13
  • 10.1145/3451212
A Voltage-Controlled, Oscillation-Based ADC Design for Computation-in-Memory Architectures Using Emerging ReRAMs
  • Mar 25, 2022
  • ACM Journal on Emerging Technologies in Computing Systems
  • Mahta Mayahinia + 10 more

Conventional von Neumann architectures cannot successfully meet the demands of emerging computation and data-intensive applications. These shortcomings can be improved by embracing new architectural paradigms using emerging technologies. In particular, Computation-In-Memory (CiM) using emerging technologies such as Resistive Random Access Memory (ReRAM) is a promising approach to meet the computational demands of data-intensive applications such as neural networks and database queries. In CiM, computation is done in an analog manner; digitization of the results is costly in several aspects, such as area, energy, and performance, which hinders the potential of CiM. In this article, we propose an efficient Voltage-Controlled-Oscillator (VCO)–based analog-to-digital converter (ADC) design to improve the performance and energy efficiency of the CiM architecture. Due to its efficiency, the proposed ADC can be assigned in a per-column manner instead of sharing one ADC among multiple columns. This will boost the parallel execution and overall efficiency of the CiM crossbar array. The proposed ADC is evaluated using a Multiplication and Accumulation (MAC) operation implemented in ReRAM-based CiM crossbar arrays. Simulations results show that our proposed ADC can distinguish up to 32 levels within 10 ns while consuming less than 5.2 pJ of energy. In addition, our proposed ADC can tolerate ≈30% variability with a negligible impact on the performance of the ADC.

  • Research Article
  • Cite Count Icon 35
  • 10.1007/s10619-006-8490-2
A novel approach to resource scheduling for parallel query processing on computational grids
  • May 1, 2006
  • Distributed and Parallel Databases
  • Anastasios Gounaris + 3 more

Advances in network technologies and the emergence of Grid computing have both increased the need and provided the infrastructure for computation and data intensive applications to run over collections of heterogeneous and autonomous nodes. In the context of database query processing, existing parallelisation techniques cannot operate well in Grid environments because the way they select machines and allocate tasks compromises partitioned parallelism. The main contribution of this paper is the proposal of a low-complexity, practical resource selection and scheduling algorithm that enables queries to employ partitioned parallelism, in order to achieve better performance in a Grid setting. The evaluation results show that the scheduler proposed outperforms current techniques without sacrificing the efficiency of resource utilisation.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.