Efficient deadline-aware scheduling for the analysis of Big Data streams in public Cloud

Mahmood Mortazavi-Dehkordi,Kamran Zamanifar

doi:10.1007/s10586-019-02908-2

Abstract

The emergence of Big Data has had a profound impact on how data are analyzed. Open source distributed stream processing platforms have gained popularity for analyzing streaming Big Data as they provide low latency required for streaming Big Data applications using Cloud resources. However, existing resource schedulers are still lacking the efficiency and deadline meeting that Big Data analytical applications require. Recent works have already considered streaming Big Data characteristics to improve the efficiency and the likelihood of deadline meeting for scheduling in the platforms. Nevertheless, they have not taken into account the specific attributes of analytical application, public Cloud utilization cost and delays caused by performance degradation of leasing public Cloud resources. This study, therefore, presents BCframework, an efficient deadline-aware scheduling framework used by streaming Big Data analysis applications based on public Cloud resources. BCframework proposes a scheduling model which considers public Cloud utilization cost, performance variation, deadline meeting and latency reduction requirements of streaming Big Data analytical applications. Furthermore, it introduces two operator scheduling algorithms based on both a novel partitioning algorithm and an operator replication method. BCframework is highly adaptable to the fluctuation of streaming Big Data and the performance degradation of public Cloud resources. Experiments with the benchmark and real-world queries show that BCframework can significantly reduce the latency and utilization cost and also minimize deadline violations and provisioned virtual machine instances.

Highlights

The emergence of data-intensive applications has rapidly increased the volume, variety and velocity of the generated data during its lifecycle which represents a major challenge for many organizations and is known as the Big Data problem [1]
BCframework performance is assessed by analyzing the average tuple latency, the utilization cost, the number of deadline misses and the number of provisioned Vm instances for the benchmark and real-world queries in presence of different Bs fluctuation scenarios
The performance of BCframework is compared with Storm default scheduler under simple and complex queries because it is one of the most popular Big Data stream computing platforms both in academia and industry [30]

Summary

Introduction

The emergence of data-intensive applications has rapidly increased the volume, variety and velocity of the generated data during its lifecycle which represents a major challenge for many organizations and is known as the Big Data problem [1]. Analysis of streaming Big Data is the last and most important stage of the streaming Big Data lifecycle in Among such infrastructures, public Cloud is an appropriate infrastructure to host the queries because it can operate as pay-per-use model and is able to provide dynamic resource scaling in response to the fluctuating resource demand of the queries. Public Cloud is an appropriate infrastructure to host the queries because it can operate as pay-per-use model and is able to provide dynamic resource scaling in response to the fluctuating resource demand of the queries This Cloud utilization model is called Infrastructure as a Service (IaaS) and can be used by the Cloud-based stream computing platforms to schedule accepted soft deadline-constrained queries using the leased resources. The performance fluctuation can lead to delayed execution of the query operators and if the operators be part of the query critical path, a deadline violation can occur

Results

Discussion

Conclusion