Abstract

Sharing scientific analyses via workflows has the potential to improve the reproducibility of research results as they allow complex tasks to be split into smaller pieces and give a visual access to the flow of data between the components of an analysis. This is particularly useful for trans-disciplinary research fields such as biodiversity and ecosystem functioning (BEF), where complex syntheses integrate data over large temporal, spatial and taxonomic scales. However, depending on the data used and the complexity of the analysis, scientific workflows can grow very complex which makes them hard to understand and reuse. Here we argue that enabling simplicity starting from the beginning of the data life cycle adhering to good practices of data management can significantly reduce the overall complexity of scientific workflows. It can simplify the processes of data inclusion, cleaning, merging and imputation. To illustrate our points we chose a typical analysis in BEF research, the aggregation of carbon pools in a forest ecosystem. We propose indicators to measure the complexity of workflow components including the data sources. We illustrate that the complexity decreases exponentially during the course of the analysis and that simple text-based measures can help to identify bottlenecks in a workflow. Taken together we argue that focusing on the simplification of data sources and workflow components will improve and accelerate data and workflow reuse and improve the reproducibility of data-driven sciences

Highlights

  • Interdisciplinary approaches, new tools and technologies, and the increasing availability of online accessible data have changed the way researchers pose questions and perform analyses[1]

  • Workflow software enables access to distributed web services providing data[1], and enables automation of the repetitive tasks that occur in every scientific analysis

  • We showed that workflow complexity and data usage of a typical analysis in biodiversity ecosystem functioning (BEF) can be quantified using relatively simple qualitative and quantitative measures based on commands, code lines, and variable numbers

Read more

Summary

Introduction

Interdisciplinary approaches, new tools and technologies, and the increasing availability of online accessible data have changed the way researchers pose questions and perform analyses[1]. Workflow software enables access to distributed web services providing data[1], and enables automation of the repetitive tasks that occur in every scientific analysis. Workflow tools such as Kepler or Pegasus help to break down complex tasks into smaller pieces[2,3]. An increase in the complexity of analyses and datasets packed into workflows can render them difficult to understand and to reuse This is true for the “long tail” of big data[4], consisting of small and highly heterogeneous files that don’t result from automated loggers but from scientific experiments, observations, or interviews. There is a lack of papers that discuss workflow components within an analysis including data processing

Objectives
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.