Discovering Similar Workflows via Provenance Clustering: A Case Study

Abdussalam Alawini,Susan Davidson,Junhyong Kim,Leshang Chen,Stephen Fisher

doi:10.1007/978-3-319-98379-0_9

Abstract

Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generation Sequencing (NGS) project we are associated with is tracking provenance in such manner. The NGS project is a collaboration between multiple groups at different sites, where each group is collecting and processing samples using an agreed-upon workflow. The workflow contains many stages with varying degrees of complexity. Over time workflow stages are modified, but data samples are only comparable when processed with identical versions of the workflow. However, for various reasons (including the distributed nature of the collaboration) it is not always clear which samples have been processed with which version of the workflow. In this paper, we introduce new techniques for clustering provenance datasets and attempt to discover the ones that are likely to be generated by same workflow. Based on the clustering result, users can identify similar provenance and would be able to categorize them into different clusters for debugging and zoom-in/zoom-out viewing.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Discovering Similar Workflows via Provenance Clustering: A Case Study

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Tracking the NGS revolution: managing life science research on shared high-performance computing clusters.
Martin Dahlö ... Wesley Schaal
GigaScience | VOL. 7
Martin Dahlö, et. al.Martin Dahlö ... Wesley Schaal
05 Apr 2018
GigaScience | VOL. 7

Looking for Plant microRNAs in Human Blood Samples: Bioinformatics Evidence and Perspectives
Lorenzo Olmi ... Angelo Gismondi
Plant Foods for Human Nutrition | VOL. 78
Lorenzo Olmi, et. al.Lorenzo Olmi ... Angelo Gismondi
31 May 2023
Plant Foods for Human Nutrition | VOL. 78

Abstract 2280: A comprehensive sample tracking and data processing workflow for next generation sequencing
Chandra Sekhar Pedamallu ... Donald Jackson
Cancer Research | VOL. 81
Chandra Sekhar Pedamallu, et. al.Chandra Sekhar Pedamallu ... Donald Jackson
01 Jul 2021
Abstract 2280: A comprehensive sample tracking and data processing workflow for next generation sequencing
Chandra Sekhar Pedamallu ... Donald Jackson

A novel procedure on next generation sequencing data analysis using text mining algorithm
Weizhong Zhao ... Roger Perkins
BMC Bioinformatics | VOL. 17
Weizhong Zhao, et. al.Weizhong Zhao ... Roger Perkins
13 May 2016
BMC Bioinformatics | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Discovering Similar Workflows via Provenance Clustering: A Case Study

Abstract

Talk to us

Similar Papers