A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

Marcin Cieślik,Cameron Mura

doi:10.1186/1471-2105-12-61

Abstract

BackgroundBioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.ResultsTo enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats).ConclusionsPaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.

Highlights

Bioinformatic analyses typically proceed as chains of data-processing tasks
Unlike business workflows, which emphasize process modeling, automation and management, and are control-flow oriented [2,3], scientific pipelines emphasize data-flow, and fundamentally consist of chained transformations of collections of data items. This is true in bioinformatics, spurring the recent development of workflow managment systems (WMS) to standardize, modularize, and execute in silico
Instances of the core Worker class are constructed by wrapping functions, and this can be done in a highly general and flexible manner: A Worker instance can be constructed de novo, from multiple pre-defined functions, from another Worker instance, or as a composition of multiple Worker instances

Summary

Introduction

Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or ‘workflow’, is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. Unlike business workflows, which emphasize process modeling, automation and management, and are control-flow oriented [2,3], scientific pipelines emphasize data-flow, and fundamentally consist of chained transformations of collections of data items. This is true in bioinformatics (see, e.g., [4] and references therein), spurring the recent development of workflow managment systems (WMS) to standardize, modularize, and execute in silico. With its emphasis on enabling facile creation of Python-based workflows for data processing (rather than, e.g., WS discovery or resource brokerage), PaPy is a task-based tool

Methods

Results

Conclusion