Abstract

Amazon spot instances have become a very popular alternative for cost-saving in the cloud. The spot instances are prone to abrupt termination whenever the spot market price exceeds the bid price. In this paper, spot instances are resorted to in task instances' group of Amazon Elastic Map Reduce (EMR) cluster to process batch jobs with deadline. Amazon EMR makes it convenient to process Big Data with the aid of the Hadoop framework. However, the processed intermediate results in the task nodes of the cluster are lost if the spot instances gets terminated which can cause processing delay. The cost efficiency can be realized by exploiting the non-real time nature of batch computing for Big Data. Two algorithms are devised for achieving cost efficient processing in Hadoop MapReduce. Both algorithms process data in divisions such that abrupt termination of spot instances only affects the last division. Based on monitoring the progress at given checkpoints, task group's capacity is resized to complete the processing within the deadline. Progress is measured in terms of the number of completed work divisions. The first algorithm begins with some spot instances whose number is initially estimated. To complete processing of all data in time, on-demand instances are deployed after a certain threshold time. The second algorithm starts by using higher number of spot instances than required to complete the work within the given deadline. Therefore, it has higher chance to rely solely on instances during the whole execution of the batch job. On-demand instances are deployed only in case of slow progress caused by termination of the spot instances combined with subsequent unsuccessful bids. The experiments show that both algorithms are able to minimize the processing cost. The second algorithm further minimizes the cost in most cases.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call