Abstract

Cloud computing enables a user to quickly provision any size Hadoop cluster, execute a given MapReduce workload, and then pay for the time the resources were used. Typically, there is a choice of different types of VM instances in the Cloud (e.g., small, medium, or large EC2 instances). The capacity differences of the offered VMs are reflected in VM's pricing. Therefore, for the same price a user can get a variety of Hadoop clusters based on different VM instance types. We observe that performance of MapReduce applications may vary significantly on different platforms. This makes a selection of the best cost/performance platform for a given workload a non-trivial problem, especially when different jobs exhibit different platform preferences. In this work, we aim to solve the following problem: given a completion time target for a set of MapReduce jobs, determine a homogeneous or heterogeneous Hadoop cluster configuration (i.e., the number, types of VMs, and the job schedule) for processing these jobs within a given dead-line while minimizing the rented infrastructure cost. We offer a simulation-based framework for solving this problem. Our evaluation study and experiments with Amazon EC2 platform reveal that for different workload mixes, an optimized platform choice may result in 41-67% cost savings for achieving the same performance objectives when using different (but seemingly equivalent) choices. Moreover, depending on a workload the heterogeneous cluster solution may outperform the homogeneous one by 26-42%. The results of our simulation study are validated through experiments with Hadoop clusters deployed on Amazon EC2 instances.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.