Data partitioning enables the use of standard SOAP Web Services in genome-scale workflows

Paweł Sztromwasser,Pál Puntervoll,Kjell Petersen

doi:10.1515/jib-2011-163

Paweł Sztromwasser, Pál Puntervoll + Show 1 more

Open Access

https://doi.org/10.1515/jib-2011-163

Copy DOI

Abstract

Summary Biological databases and computational biology tools are provided by research groups around the world, and made accessible on the Web. Combining these resources is a common practice in bioinformatics, but integration of heterogeneous and often distributed tools and datasets can be challenging. To date, this challenge has been commonly addressed in a pragmatic way, by tedious and error-prone scripting. Recently however a more reliable technique has been identified and proposed as the platform that would tie together bioinformatics resources, namely Web Services. In the last decade the Web Services have spread wide in bioinformatics, and earned the title of recommended technology. However, in the era of high-throughput experimentation, a major concern regarding Web Services is their ability to handle large-scale data traffic. We propose a stream-like communication pattern for standard SOAP Web Services, that enables efficient flow of large data traffic between a workflow orchestrator and Web Services. We evaluated the data-partitioning strategy by comparing it with typical communication patterns on an example pipeline for genomic sequence annotation. The results show that data-partitioning lowers resource demands of services and increases their throughput, which in consequence allows to execute in-silico experiments on genome-scale, using standard SOAP Web Services and workflows. As a proof-of-principle we annotated an RNA-seq dataset using a plain BPEL workflow engine.

Highlights

Combining scientific resources is vital for acquiring a complete picture of a scientific problem, and plays a key role in the process of generating new knowledge
We present the data-partitioning communication pattern for standard SOAP Web Services, which is the main result of this work
The throughput of the AIAO communication pattern is limited by the size of the array that can be sent in a single message

Summary

Introduction

Combining scientific resources is vital for acquiring a complete picture of a scientific problem, and plays a key role in the process of generating new knowledge. A plethora of tools and databases hosted at different sites around the globe are available to the life sciences community, and constitute a vast mine of information. Integration of all these distributed resources creates new perspectives, enables scientists to ask broader questions, but it poses new challenges. The challange is often addressed by ad-hoc scripting, that tightly couples required resources This pragmatic approach is a tedious and error-prone process, so recently a more promising method has gained significant attention. SOAP Web Services have been proposed as the technology that can connect the distributed, heterogeneous bioinformatics resources [1]. These workflows can be complex analysis pipelines representing in-silico experiments

Methods

Results

Discussion

Conclusion