Unmanaged Workflows: Their Provenance and Use
Provenance of scientific data will play an increasingly critical role as scientists are encouraged by funding agencies and grand challenge problems to share and preserve scientific data. But it is foolhardy to believe that all human processes, particularly as varied as the scientific discovery process, will be fully automated by a workflow system. Consequently, provenance capture has to be thought of as a problem applied to both human and automated processes. The unmanaged workflow is the full human-driven activity, encompassing tasks whose execution is automated by an orchestration tool, and tasks that are done outside an orchestration tool. In this chapter we discuss the implications of the unmanaged workflow as it affects provenance capture, representation, and use. Illustrations of capture include multiple experiences with unmanaged capture using the Karma tool. Illustrations of use include defining workflows by suggesting additions to workflow designs under construction, reconstructing process traces, and using analysis tools to assess provenance quality.KeywordsData provenancee-Science workflowsprovenance capturedata miningcase-based reasoningintelligent user interfaces
- Research Article
15
- 10.1016/j.socscimed.2023.116194
- Aug 29, 2023
- Social science & medicine (1982)
Sustaining positive perceptions of science in the face of conflicting health information: An experimental test of messages about the process of scientific discovery
- Research Article
5
- 10.1007/s10686-011-9276-8
- Dec 13, 2011
- Experimental Astronomy
Most workflow systems that support data provenance primarily focus on tracing lineage of data. Data provenance by data lineage provides the derivation history of data including information about services and input data that contributed to the creation of a data product. We show that tracing lineage by means of full backward chaining not only enables users to share, discover and reuse the data, but also supports scientific data processing through storage, retrieval and (re)processing of digitized scientific data. In this paper, we present Astro-WISE, a distributed system for processing, analyzing and disseminating wide field imaging astronomical data. We show how Astro-WISE traces lineage of data and how it facilitates data processing, retrieval, storage and archiving. Particularly we show how it solves issues related to the changing data items typical for the scientific environment, such as physical changes in calibrations, our insight in these changes and improved methods for deriving results.
- Conference Article
17
- 10.1109/nbis.2009.48
- Aug 1, 2009
Most workflow systems that support data provenance primarily focus on tracing lineage of data. Data provenance by data lineage provides the derivation history of data including information about services and input data that contributed to the creation of a data product. We show that tracing lineage by means of full backward chaining not only enables users to share, discover and reuse the data, but also supports scientific data processing through storage, retrieval and (re)processing of digitized scientific data. In this paper, we present Astro-WISE, a distributed system for processing, analyzing and disseminating wide field imaging astronomical data. We show how Astro-WISE traces lineage of data and how it facilitates data processing, retrieval, storage, archiving. Particularly we show how it solves issues related to the changing data items typical for the scientific environment, such as physical changes in calibrations, our insight in these changes and improved methods for deriving results.
- Conference Article
3
- 10.36334/modsim.2013.k5.car
- Dec 1, 2013
Large multi-disciplinary scientific projects that inform government policy and have a high public profile are often exposed to high levels of scrutiny.Such projects rely on a range of input datasets and modelling software packages and generate high volumes of output data, which are presented as summarised results in published reports.Defending the scientific integrity of project reporting requires that all project results have demonstrable integrity with clear evidence of the workflows and processes used to generate them, i.e. they must implement structured data management including provenance capture and storage.Provenance data capture forms part of effective data management.The reporting of data provenance needs to occur in all workflows within a project and crucially needs support from project management, and adoption by project staff so that provenance chains are unbroken at every step, thus providing demonstrable integrity.Even when project funds and milestones are allocated to provenance tasks, such as ensuring staff store project datasets in managed locations and generate standardised dataset metadata records, data provenance capture has often been poor.This indicates that the barrier to the adoption of useful data provenance tasks is still significant.The development and application of automated systems, which capture and report provenance without additional user effort, are therefore of critical importance in helping to lower this barrier thus easing cultural change in data management.Even if a project or organisation has motivation, has made the case, established a vision, and developed plans to implement provenance management, buy-in from all project staff is still required for success.This is because provenance chains containing information about data lifecycles need to be unbroken for all results, thus requiring involvement from all project staff.Some, perhaps the majority, of project processes cannot be automated, thus they will require significant manual effort in order to be included in provenance management.This paper outlines previous best-practice regarding CSIRO's data management approach as demonstrated by the Murray Darling Basin Sustainable Yields project, and reflects on their shortcomings, such as the lack of adequate provenance capture, with improvements suggested.It then describes several automated provenance management tools that employ semantic web technologies and preserve the identity of provenance reports and datasets; which may be used to help with bottom-up practice adoption.The automated provenance management tools can provide well-defined, automated processes, which may help to lower the barriers preventing cultural change for data management at the project and organisational level.It is hoped that the improved data management practices and the automated tools discussed here can inform current and new high-profile projects, such as the Bioregional Assessments program, to attain a higher quality of demonstrable data integrity through more robust provenance management.
- Research Article
14
- 10.5860/choice.38-2132
- Dec 1, 2000
- Choice Reviews Online
From the Publisher: Wagman offers a critical analysis of current theory and research in the psychological and computational sciences, directed toward the elucidation of scientific discovery processes and structures. It discusses human scientific discovery processes, analyzes computer scientific discovery processes, and makes a comparative evaluation of the two. This work examines the scientific reasoning of the discoverers of the inhibition mechanism of gene control; scientific discovery heuristics used at different developmental levels; artificial intelligence and mathematical discovery; the ECHO system; the evolution of artificial intelligence discovery systems; the PAULI system; and the KEKADA system. It concludes with an examination of the extent to which computational discovery systems can emulate a set of 10 types of scientific problems.
- Research Article
3
- 10.1089/acm.2007.7016
- May 1, 2007
- The Journal of Alternative and Complementary Medicine
The Journal of Alternative and Complementary MedicineVol. 13, No. 4 EditorialsEarly Phase Research and the Process of Scientific DiscoveryGary E. SchwartzGary E. SchwartzSearch for more papers by this authorPublished Online:28 May 2007https://doi.org/10.1089/acm.2007.7016AboutSectionsPDF/EPUB ToolsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail "Early Phase Research and the Process of Scientific Discovery." , 13(4), pp. 399–400FiguresReferencesRelatedDetailsCited ByProspective Safety Evaluation of a Cardiovascular Health Dietary Supplement in Adults with Prehypertension and Stage I Hypertension Jennifer Joan Ryan, Douglas Allen Hanes, Jamie Corroon, Jan Taylor, and Ryan Bradley20 February 2019 | The Journal of Alternative and Complementary Medicine, Vol. 25, No. 2Safety and Tolerability of an Antiasthma Herbal Formula (ASHMI™) in Adult Subjects with Asthma: A Randomized, Double-Blinded, Placebo-Controlled, Dose-Escalation Phase I Study Kristin Kelly-Pieper, Sangita P. Patil, Paula Busse, Nan Yang, Hugh Sampson, Xiu-Min Li, Juan P. Wisnivesky, and Meyer Kattan8 July 2009 | The Journal of Alternative and Complementary Medicine, Vol. 15, No. 7 Volume 13Issue 4May 2007 InformationMary Ann Liebert, Inc.To cite this article:Gary E. Schwartz.Early Phase Research and the Process of Scientific Discovery.The Journal of Alternative and Complementary Medicine.May 2007.399-400.http://doi.org/10.1089/acm.2007.7016Published in Volume: 13 Issue 4: May 28, 2007PDF download
- Research Article
29
- 10.5860/choice.48-6899
- Aug 1, 2011
- Choice Reviews Online
Trundling along in essentially the same form for some 220 million years, turtles have seen dinosaurs come and go, mammals emerge, and humankind expand its dominion. Is it any wonder the persistent reptile bested the hare? In this engaging book, physiologist Donald Jackson shares a lifetime of observation of this curious creature, allowing us a look under the shell of an animal at once so familiar and so strange. Here we discover how the turtle's proverbial slowness helps it survive a long, cold winter under ice. How the shell not only serves as a protective home but also influences such essential functions as buoyancy control, breathing, and surviving remarkably long periods without oxygen, and how many other physiological features help define this unique animal. Jackson offers insight into what exactly it's like to live inside a shell - to carry the heavy carapace on land and in water, to breathe without an expandable ribcage, to have sex with all that body armor intervening. Along the way we also learn something about the process of scientific discovery - how the answer to one question leads to new questions, how a chance observation can change the direction of study, and above all how new research always builds on the previous work of others. A clear and informative exposition of physiological concepts using the turtle as a model organism, the book is as interesting for what it tells us about scientific investigation as it is for its deep and detailed understanding of how the enduring turtle 'works'.
- Book Chapter
1
- 10.1007/978-3-319-70102-8_11
- Jan 1, 2017
Huge amounts of data are being generated by Internet of Things (IoT) devices. Termed as Big Data, this data needs to be reliably stored, extracted, and analyzed. Capturing provenance of such data provides a mechanism to explain the result of data analytics and provides greater trustworthiness to the insights gathered from data analytics. Capturing the provenance of the data stored in NoSQL databases can help to understand how the data reached its current state. A holistic explanation of the results of data analytics can be achieved through the combination of provenance information of the data with results of analytics. This chapter explores the challenges of automatic provenance capture at the middleware level in three different contexts: in an analytics framework like MapReduce, in NoSQL data stores with MapReduce analytic framework, and in NoSQL stores with SQL front ends. The chapter also portrays how the provenance captured in the MapReduce framework is useful for improving the future executions of job reruns and anomaly detection, apart from its use in debugging.
- Book Chapter
78
- 10.1007/3-540-48912-6_65
- Jan 1, 1999
This paper presents some significant fundamental observa-tions and/or assumptions on scientific discovery processes and their automation, shows why classical mathematical logic, its various classical conservative extensions, and traditional (weak) relevant logics cannot satisfactorily underlie epistemic processes in scientific discovery, and presents a strong relevant logic model of epistemic processes in scientific discovery.KeywordsScientific DiscoveryEpistemic StateBelief RevisionScientific ReasoningProgramming ParadigmThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Book Chapter
4
- 10.1007/978-3-642-17819-1_35
- Jan 1, 2010
Running scientific workflows in distributed environments is motivating the definition of provenance gathering approaches that are loosely coupled to the workflow execution engine. This kind of approach is interesting because it allows both storage and access to provenance data in an integrated way, even in an environment where different workflow systems work together. Therefore, we have proposed a provenance gathering strategy that is independent from the workflow system technology. This strategy has evolved into a provenance management system named ProvManager. In this paper we show how provenance data is captured along in a distributed execution environment with ProvManager and we show its web interface, in which scientists can register experiments, monitor workflow execution, and query provenance data.
- Research Article
11
- 10.3233/jcs-200108
- Apr 27, 2021
- Journal of Computer Security
Data provenance collects comprehensive information about the events and operations in a computer system at both application and kernel levels. It provides a detailed and accurate history of transactions that help delineate the data flow scenario across the whole system. Data provenance helps achieve system resilience by uncovering several malicious attack traces after a system compromise that are leveraged by the analyzer to understand the attack behavior and discover the level of damage. Existing literature demonstrates a number of research efforts on information capture, management, and analysis of data provenance. In recent years, provenance in IoT devices attracts several research efforts because of the proliferation of commodity IoT devices. In this survey paper, we present a comparative study of the state-of-the-art approaches to provenance by classifying them based on frameworks, deployed techniques, and subjects of interest. We also discuss the emergence and scope of data provenance in IoT network. Finally, we present the urgency in several directions that data provenance needs to pursue, including data management and analysis.
- Research Article
29
- 10.1046/j.1523-1739.2003.01721.x
- Mar 25, 2003
- Conservation Biology
Conservation Science and NGOs
- Research Article
16
- 10.1016/j.compag.2019.01.044
- Feb 28, 2019
- Computers and Electronics in Agriculture
Towards integration of data-driven agronomic experiments with data provenance
- Research Article
1
- 10.5860/choice.50-6154
- Jul 1, 2013
- Choice Reviews Online
The scientific method is one of the most basic and essential concepts across the sciences, ensuring that investigations are carried out with precision and thoroughness. The scientific method is typically taught as a step-by-step approach, but real examples from history are not always given. This book teaches the basic modes of scientific thought, not by philosophical generalizations, but by illustrating in detail how great scientists from across the sciences solved problems using scientific reason. Examples include Christopher Columbus, Joseph Priestly, Antoine Lavoisier, Michael Faraday, Wilhelm R ntgen, Max Planck, Albert Einstein, and Niels Bohr. Written by a successful research physicist who has engaged in many studies and years of research, all in the attempt to extract the secrets of nature, this book captures the excitement and joy of research. The process of scientific discovery is as delightfully absorbing, as complex, and as profoundly human as falling in love. It can be a roller coaster ride of despairing valleys and exhilarating highs. This book sketches the powerful reasoning that led to many different discoveries, but also celebrates the ah-ha moments experienced by each scientist, letting readers share the thrilling instant when each scientist reached the critical revelation in his research. * Places the scientific method in context using historical examples* Suitable for both scientists and non-scientists looking to better understand scientific reasoning* Written in an engaging style with clear illustrations and referencing
- Conference Article
40
- 10.1109/works49585.2019.00006
- Oct 28, 2019
Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.