Workflow-driven programming paradigms for distributed analysis of biological big data

Ilkay Altintas

doi:10.1109/iccabs.2013.6629243

Abstract

Scientific workflows have been used as a programming model to automate scientific tasks ranging from short pipelines to complex workflows that span across heterogeneous data and computing resources. While utilization of scientific workflow technologies varies slightly across different scientific disciplines, all informatics and computational science disciplines provide a common set of attributes to facilitate and accelerate workflow-driven research. Scientific workflows provide assembly of complex processing easily in local or distributed environments via rich and expressive programming models. Scientific workflows enable transparent access to diverse resources ranging from local clusters and traditional supercomputers to elastic and heterogeneous Cloud resources. Scientific workflows support incorporation of multiple software tools including domain specific tools for standard processing to custom generalized workflows and middleware tools that can be reused in various contexts. Scientific workflows often collect provenance information on workflow entities, e.g., workflow definitions, their executions and run time parameters, and, in turn, assure a level of reproducibility while enabling referencing and replicating results. While doing all these, scientific workflows often foster an open-source, open-access and standards-driven community development model based on sharing and collaborations. Cyberinfrastructure platforms and gateways commonly employ scientific workflows to bridge the gap between the infrastructure and users needs. While capturing and communicating the scientific process formally, workflows ensure flexibility, synergy between users, provide optimized usage of resources, increase reuse and ensure compliance with system specific data models and community-driven standards. Currently, scientific workflows are used widely in life sciences at different stages of end-to-end data lifecycle from generation to analysis and publication of biological data. The data handled by such workflows can be produced by sequencers, sensor networks, medical imaging instruments and other heterogeneous resources at significant rates at decreasing costs making the analysis and archival of such data a 'big data' challenge. Additionally, these new biological data resources are making new and exciting research in areas including metagenomics and personalized medicine possible. However, the analysis of big biological data is still very costly requiring new scalable computational models and programming paradigms to be applied to biological analysis. Although, some new paradigms exists for analysis of big data, application of these best practices to life sciences is still in its infancy. Scientific workflows can act as a scaffold and help speed this process up via combination of existing programming models and computational models with the challenges of biological problems as reusable blocks. In this talk, I will talk about such an approach that builds upon distributed data parallel patterns, e.g., MapReduce, and underlying execution engines, e.g., Hadoop, and matches the computational requirements of bioinformatics tools with such patterns and engines. The results of the presented approach is developed as a part of the bioKepler (bioKepler.org) module and can be downloaded to work within the release 2.4 of the Kepler scientific workflow system (kepler-project.org).

Full Text