Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis

Steffen Möller,Brad Chapman,Michael R Crusoe,Lars Wirzenius,Andreas Tille,Petter Reinholdtsen,Matúš Kalaš,Andrea Bagnacani,Pjotr Prins,Stian Soiland-Reyes,Fabian Klötzl,Stuart W Prescott

doi:10.1007/s41019-017-0050-4

Abstract

Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remote to their clients. However, for most academic environments with local expertise, novel data collection techniques or novel data analysis, it is essential to have all the flexibility of open-source tools and open-source workflow descriptions. Workflows in data-driven science such as computational biology have considerably gained in complexity. New tools or new releases with additional features arrive at an enormous pace, and new reference data or concepts for quality control are emerging. A well-abstracted workflow and the exchange of the same across work groups have an enormous impact on the efficiency of research and the further development of the field. High-throughput sequencing adds to the avalanche of data available in the field; efficient computation and, in particular, parallel execution motivate the transition from traditional scripts and Makefiles to workflows. We here review the extant software development and distribution model with a focus on the role of integration testing and discuss the effect of common workflow language on distributions of open-source scientific software to swiftly and reliably provide the tools demanded for the execution of such formally described workflows. It is contended that, alleviated from technical differences for the execution on local machines, clusters or the cloud, communities also gain the technical means to test workflow-driven interaction across several software packages.

Highlights

An enormous amount of data is available in public databases, institutional data archives or generated locally
This remote wealth is immediately downloadable, but its interpretation is hampered by the variation of samples and their
All sciences are challenged with data management, and particle physics, astronomy, medicine and biology are known for data-driven research

Summary

Introduction

An enormous amount of data is available in public databases, institutional data archives or generated locally. This remote wealth is immediately downloadable, but its interpretation is hampered by the variation of samples and their. Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to. Biomedical condition, the technological preparation of the sample and data formats. All sciences are challenged with data management, and particle physics, astronomy, medicine and biology are known for data-driven research. The influx of data further increases with more technical advances and higher acceptance in the community. Local compute facilities grow and have become extensible by public clouds, which all need to be maintained and the scientific execution environment be prepared

Methods

Results

Conclusion