Abstract

SummaryIn 2011, Google released a 1‐month production trace with hundreds of thousands of jobs running across over 12,000 heterogeneous hosts. In order to perform in‐depth research based on the trace, it is necessary to construct a close‐to‐practice simulation system. In this paper, we devise a distributed cloud simulator (or toolkit) based on virtual machines, with three important features. (1) The dynamic changing resource amounts (such as CPU rate and memory size) consumed by the reproduced jobs can be emulated as closely as possible to the real values in the trace. (2) Various types of events (e.g., kill/evict event) can be emulated precisely based on the trace. (3) Our simulation toolkit is able to emulate more complex and useful cases beyond the original trace to adapt to various research demands. We evaluate the system on a real cluster environment with 16×8=128 cores and 112 virtual machines constructed by XEN hypervisor. To the best of our knowledge, this is the first work to reproduce Google cloud environment with real experimental system setting and real‐world large scale production trace. Experiments show that our simulation system could effectively reproduce the real checkpointing/restart events based on Google trace, by leveraging Berkeley Lab Checkpoint/Restart tool. It can simultaneously process up to 1200 emulated Google jobs over the 112 virtual machines. Such a simulation toolkit has been released as a GNU GPL v3 software for free downloading, and it has been successfully applied to the fundamental research on the optimization of checkpoint intervals for Google tasks. Copyright © 2014 © Published 2014. This article is a U.S. Government work and is in the public domain in the USA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call