Abstract

Scientific progress increasingly depends on data management, particularly to clean and curate data so that it can be systematically analyzed and reused. A wealth of techniques for managing and curating data (and its provenance) have been proposed, largely in the database community. In particular, a number of influential papers have proposed collecting provenance information explaining where a piece of data was copied from, or what other records were used to derive it. Most of these techniques, however, exist only as research prototypes and are not available in mainstream database systems. This means scientists must either implement such techniques themselves or (all too often) go without. This is essentially a code reuse problem: provenance techniques currently cannot be implemented reusably, only as ad hoc, usually unmaintained extensions to standard databases. An alternative, relatively unexplored approach is to support such techniques at a higher abstraction level, using metaprogramming or reflection techniques. Can advanced programming techniques make it easier to transfer provenance research results into practice? We build on a recent approach called language-integrated provenance, which extends language-integrated query techniques with source-to-source query translations that record provenance. In previous work, a proof of concept was developed in a research programming language called Links, which supports sophisticated Web and database programming. In this paper, we show how to adapt this approach to work in Haskell building on top of the Database-Supported Haskell (DSH) library. Even though it seemed clear in principle that Haskell's rich programming features ought to be sufficient, implementing language-integrated provenance in Haskell required overcoming a number of technical challenges due to interactions between these capabilities. Our implementation serves as a proof of concept showing how this combination of metaprogramming features can, for the first time, make data provenance facilities available to programmers as a library in a widely-used, general-purpose language. In our work we were successful in implementing forms of provenance known as where-provenance and lineage. We have tested our implementation using a simple database and query set and established that the resulting queries are executed correctly on the database. Our implementation is publicly available on GitHub. Our work makes provenance tracking available to users of DSH at little cost. Although Haskell is not widely used for scientific database development, our work suggests which languages features are necessary to support provenance as library. We also highlight how combining Haskell's advanced type programming features can lead to unexpected complications, which may motivate further research into type system expressiveness.

Highlights

  • Provenance is information about the origin, derivation or history of an object

  • Our work demonstrates that provenance tracking does not have to be built into a language or database implementation, but can be provided as a library instead

  • This is an important step towards supporting provenance tracking for scientific database systems written in mainstream programming languages

Read more

Summary

Introduction

Provenance is information (metadata) about the origin, derivation or history of an object. Most of these systems have been implemented as ad hoc extensions to mainstream database systems Systems such as Perm [ ] showed that it is possible to support provenance by providing a middleware layer on top of the database, and translating queries to record their own provenance. Figure Agencies and tours database it is possible to implement the query transformations needed for where-provenance and lineage tracking at a high level in Links, before compiling queries to SQL. Their approach required making nontrivial modifications to the Links interpreter. Our implementation is publicly available on GitHub: https://github.com/jstolarek/skye-dsh

Background
Design of provenance tracking in DSH
Conclusions
Standard Prelude
Algebraic Data Types
Type Classes
List comprehensions
B Examples of DSH queries with provenance tracking
Lineage tracking
Example
D Lineage implementation details
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call