Several scientists have moved their IO- and CPU-intensive workflows to Data-Intensive Scalable Computing (DISC) frameworks aiming at benefit from high scalability, broad support, and manufacturers’ infrastructure. A prominent framework is Apache Spark, which has been on an absolute tear over the last ten years and became one of the most widely used technologies in big data. Apache Spark brings several advantages along, as granting very efficient in-memory data management for large-scale applications through Resilient Distributed Datasets (RDDs). Such an in-memory replacement for MapReduce enables data handling activities of scientific workflows to be executed orders of magnitude faster in comparison to other DISC environments. A major drawback, however, is Apache Spark still lacks support for both data tracking and workflow provenance. Accordingly, the sole alternative for users that rely on provenance features is to spend countless hours collecting data from log files. Moreover, as one additional challenge, Apache Spark interprets legacy programs within workflows as “black-box” activities, which prevents the capture and analysis of data movements through RDDs. This manuscript presents the SAMbA-RaP (Spark provenAnce MAnagement with Reports and Presentation) solution for capturing, storing, and querying prospective and retrospective provenance, as well as domain data within distributed scientific workflows. SAMbA-RaP performance was evaluated upon real workflow cases (SciPhy, Montage, WordCount, BuzzFlow, and SalesForecasts) from distinct domains, e.g., literature, bioinformatics and astronomy, and results indicate the average imposed overhead for managing provenance data is acceptable. Moreover, experiments also indicate our solution is capable of handling workflows with and without legacy applications alike, which enables users to query and verify provenance data on SAMbA-RaP reports straightforwardly and transparently.
Read full abstract