Abstract

As businesses increasingly rely on cloud-based big data analytics services to drive insights, reducing the cost of storing and analyzing large volumes of data in the cloud has become a major concern. During the execution of big data analysis jobs, some of the generated data can be reused by subsequent jobs. By storing such intermediate data, the cost of running big data jobs can be greatly reduced for businesses using cloud services. An important challenge is how to determine which data should be stored in order to save costs. Existing storing strategies do not differentiate between data with different usage frequencies, resulting in significant storage costs in practical applications. To address the aforementioned challenges, in this paper we propose two online algorithms, one deterministic and the other randomized, which dynamically determine whether to store the data with the aim of saving cost. We show that our proposed deterministic algorithm (resp., randomized) incurs costs within a factor of 2−α′ (resp., 21+α′) times the minimum cost obtained by an optimal offline algorithm which is assumed to know the exact future a priori. Finally, through extensive experiments with real-world workload of big data jobs in Alibaba Cloud environment, we demonstrate that our proposed online algorithms can achieve significant cost savings under common cloud pricing schemes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call