Abstract

Provenance metadata captures history of derivation of an entity, such as a dataset obtained through numerous data transformations. It is of great importance for science, among other fields, as it enables reproducibility and greater intelligibility of research results. With the avalanche of provenance produced by today’s society, there is a pressing need for storing and low-latency querying of large provenance graphs. To address this need, in this paper we present a scalable approach to storing and querying provenance graphs using a popular NoSQL column family database system called DataStax Enterprise (DSE). Specifically, we i) propose a storage scheme, including two novel indices that enable efficient traversal of provenance graphs along causality lines, ii) present an algorithm for building our proposed indices for a given provenance graph, iii) implement our algorithm and conduct a performance study in which we store and query a provenance graph with over five million vertices using a DSE cluster running in AWS cloud. Our performance study results further validate scalability and performance efficiency of our approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.