Abstract
As businesses increasingly rely on cloud-based big data analytics services to drive insights, reducing the cost of storing and analyzing large volumes of data in the cloud has become a major concern. During the execution of big data analysis jobs, some of the generated data can be reused by subsequent jobs. By storing such intermediate data, the cost of running big data jobs can be greatly reduced for businesses using cloud services. An important challenge is how to determine which data should be stored in order to save costs. Existing storing strategies do not differentiate between data with different usage frequencies, resulting in significant storage costs in practical applications. To address the aforementioned challenges, in this paper we propose two online algorithms, one deterministic and the other randomized, which dynamically determine whether to store the data with the aim of saving cost. We show that our proposed deterministic algorithm (resp., randomized) incurs costs within a factor of 2−α′ (resp., 21+α′) times the minimum cost obtained by an optimal offline algorithm which is assumed to know the exact future a priori. Finally, through extensive experiments with real-world workload of big data jobs in Alibaba Cloud environment, we demonstrate that our proposed online algorithms can achieve significant cost savings under common cloud pricing schemes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.