Abstract

Cloud computing offers massive scalability and elasticity required by many scientific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new opportunities for application developers. This paper investigates how workflow systems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.

Highlights

  • Cloud Computing is a new and emerging computing paradigm that has the potential to completely change the way how commercial and scientific applications are deployed, hosted and executed

  • This paper describes a generic approach based on infrastructure aware workflows [22] when integrating workflow-based science gateway frameworks with Big Data processing

  • An application was deployed in the CloudBroker Platform which installed Hadoop 2.7.1 on an Ubuntu 14.04 trusty server and saved the instance as a snapshot to be used for submitting jobs and for launching nodes in the Hadoop cluster

Read more

Summary

Introduction

Cloud Computing is a new and emerging computing paradigm that has the potential to completely change the way how commercial and scientific applications are deployed, hosted and executed. New parallel approaches and algorithms are constantly being proposed One of such examples is Apache Hadoop [1], an open-source implementation of the MapReduce framework [2] introduced by Google in 2004. The investigation and results presented in this paper are focused around the extension of science gateway frameworks and grid/cloud workflow systems with Big Data handling and MapReduce based parallelism. This section briefly describes these base-line technologies This set of technologies enables the execution and sharing of scientific workflows in a cloud computing environment providing the basis for the cloud-based Big Data integration. GUSE provides with WS-PGRADE, a Liferay based portal to create and execute scientific workflows in various Distributed Computing Infrastructures (DCIs) including clusters, grids and clouds. End-users can import applications available in these repositories, configure them with their own input files/parameters and run them in the infrastructure of their choice

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call