Workflows for microarray data processing in the Kepler environment

Thomas Stropp,Mark Bieda,Bertram Ludäscher,Timothy Mcphillips

doi:10.1186/1471-2105-13-102

Thomas Stropp, Mark Bieda + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-13-102

Copy DOI

Abstract

BackgroundMicroarray data analysis has been the subject of extensive and ongoing pipeline development due to its complexity, the availability of several options at each analysis step, and the development of new analysis demands, including integration with new data sources. Bioinformatics pipelines are usually custom built for different applications, making them typically difficult to modify, extend and repurpose. Scientific workflow systems are intended to address these issues by providing general-purpose frameworks in which to develop and execute such pipelines. The Kepler workflow environment is a well-established system under continual development that is employed in several areas of scientific research. Kepler provides a flexible graphical interface, featuring clear display of parameter values, for design and modification of workflows. It has capabilities for developing novel computational components in the R, Python, and Java programming languages, all of which are widely used for bioinformatics algorithm development, along with capabilities for invoking external applications and using web services.ResultsWe developed a series of fully functional bioinformatics pipelines addressing common tasks in microarray processing in the Kepler workflow environment. These pipelines consist of a set of tools for GFF file processing of NimbleGen chromatin immunoprecipitation on microarray (ChIP-chip) datasets and more comprehensive workflows for Affymetrix gene expression microarray bioinformatics and basic primer design for PCR experiments, which are often used to validate microarray results. Although functional in themselves, these workflows can be easily customized, extended, or repurposed to match the needs of specific projects and are designed to be a toolkit and starting point for specific applications. These workflows illustrate a workflow programming paradigm focusing on local resources (programs and data) and therefore are close to traditional shell scripting or R/BioConductor scripting approaches to pipeline design. Finally, we suggest that microarray data processing task workflows may provide a basis for future example-based comparison of different workflow systems.ConclusionsWe provide a set of tools and complete workflows for microarray data analysis in the Kepler environment, which has the advantages of offering graphical, clear display of conceptual steps and parameters and the ability to easily integrate other resources such as remote data and web services.

Highlights

Microarray data analysis has been the subject of extensive and ongoing pipeline development due to its complexity, the availability of several options at each analysis step, and the development of new analysis demands, including integration with new data sources
It has long been recognized that development of bioinformatics pipelines is a good way to organize complex analyses [5]
The primary goal of this work is to provide fully functional workflows for critical microarray tasks in the Kepler environment, a well-supported workflow system used in other areas of scientific research

Summary

Introduction

Microarray data analysis has been the subject of extensive and ongoing pipeline development due to its complexity, the availability of several options at each analysis step, and the development of new analysis demands, including integration with new data sources. Kepler provides a flexible graphical interface, featuring clear display of parameter values, for design and modification of workflows It has capabilities for developing novel computational components in the R, Python, and Java programming languages, all of which are widely used for bioinformatics algorithm development, along with capabilities for invoking external applications and using web services. There have been dimensional increases in ‘omics datasets, including introduction of new types of data, increases in the size of individual datasets, increases in the varieties of experimental platforms within a given data type (e.g. more varieties of gene expression microarrays), and very large increases in the total number of datasets being generated. These approaches often yield pipelines that are difficult to understand, modify or extend

Objectives

Results

Discussion

Conclusion