Abstract

Many techniques such as scheduling and resource provisioning rely on performance prediction of workflow tasks for varying input data. However, such estimates are difficult to generate in the cloud. This paper introduces a novel two-stage machine learning approach for predicting workflow task execution times for varying input data in the cloud. In order to achieve high accuracy predictions, our approach relies on parameters reflecting runtime information and two stages of predictions. Empirical results for four real world workflow applications and several commercial cloud providers demonstrate that our approach outperforms existing prediction methods. In our experiments, our approach respectively achieves a best-case and worst-case estimation error of 1.6 and 12.2 percent, while existing methods achieved errors beyond 20 percent (for some cases even over 50 percent) in more than 75 percent of the evaluated workflow tasks. In addition, we show that the models predicted by our approach for a specific cloud can be ported with low effort to new clouds with low errors by requiring only a small number of executions.

Highlights

  • THE cloud computing paradigm offers various advantages for scientific applications, including rapid provisioning of resources, pay-per-use and elasticity of a flexible amount of resources

  • We propose a performance prediction method that falls into the first category of analytically modeling to predict the execution time of workflow tasks for clouds

  • We consider two ensemble methods called Bagging [27] and Random Forest [3]. The former has been already applied to performance prediction in the Cloud [17]

Read more

Summary

Introduction

THE cloud computing paradigm offers various advantages for scientific applications, including rapid provisioning of resources, pay-per-use and elasticity of a flexible amount of resources. Workflow applications [1] consist of a possible large number of components, known as workflow tasks, such as legacy programs, data analysis or computational methods, complex simulations or even smaller subworkflows. These components are connected by data and control flow dependencies. A crucial aspect for scientific workflows is the effective optimization of runtimes, resource usage and economic costs These goals can be achieved through the use of different techniques; in particular, scheduling or determining the resource on where to execute each workflow task and resource-provisioning that determines how many resources of which type are needed [2]. Cloud infrastructures offer a wide variety of computing resources, execution times may only be known for a subset of cloud providers and for a restricted set of workflow input data

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.