Abstract

Provenance network analytics is a novel data analytics approach that helps infer properties of data, such as quality or importance, from their provenance. Instead of analysing application data, which are typically domain-dependent, it analyses the data’s provenance as represented using the World Wide Web Consortium’s domain-agnostic PROV data model. Specifically, the approach proposes a number of network metrics for provenance data and applies established machine learning techniques over such metrics to build predictive models for some key properties of data. Applying this method to the provenance of real-world data from three different applications, we show that it can successfully identify the owners of provenance documents, assess the quality of crowdsourced data, and identify instructions from chat messages in an alternate-reality game with high levels of accuracy. By so doing, we demonstrate the different ways the proposed provenance network metrics can be used in analysing data, providing the foundation for provenance-based data analytics.

Highlights

  • Provenance, a description of what influenced the generation of a piece of information or data, has become an important topic in several communities since it exposes how information flows in systems, providing the means to make them accountable and helping users decide whether information is to be trusted (Moreau 2010)

  • These results suggest that provenance type information, as captured by the provenance-specific network metrics, helps with identifying the originator of a provenance graph, and ignoring such information will result in a lower performance

  • The strong correlation between the provenance network metrics and data quality in CollabMap discovered by the classifiers suggests that analysing network metrics of provenance graphs is a promising approach to making sense of the activities and data they describe, such as classifying crowd-generated data into trust categories as in this case

Read more

Summary

Introduction

Provenance, a description of what influenced the generation of a piece of information or data, has become an important topic in several communities since it exposes how information flows in systems, providing the means to make them accountable and helping users decide whether information is to be trusted (Moreau 2010). As a provenance description ‘links’ artefacts with their influences, it can be represented in a graph, called a provenance graph, whose nodes represent the artefacts/influences and whose edges their relations with one another Studying such graphs, e.g. by visualising them, can facilitate understanding of the provenance information they contain. Wolstencroft et al 2013; Silva et al 2011; Gil et al 2011; Bowers et al 2008) being applied to peta-scale problems, are generating vast amount of provenance information. Such large and complex graphs are overwhelming for manual interpretation or verification (of data correctness, for instance). An automated and principled way to analyse provenance data of such scales and, more importantly, to understand what they convey with respect to the data they describe, is much needed

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.