Abstract
In this paper, we look at how we can leverage Spark platform for efficiently processing fine-grained provenance queries on large volumes of workflow provenance data. Simple recursive querying based Spark solutions involve large data scanning cost and hence do not work well. We propose a novel provenance framework which is engineered to quickly determine a small volume of data containing the entire lineage of the queried data-item. This small volume of data is then recursively processed to figure out the provenance of the queried data-item. We study the effectiveness of the proposed framework on a provenance trace obtained from a financial domain text curation workflow and report our observations. We show that the proposed framework easily outperforms the naive approaches.
Submitted Version (
Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.