Abstract Cloud computing revolutionizes the way large amounts of data are processed and offers a compelling paradigm to organizations. An increasing number of data-intensive scientific applications are being ported to cloud environments such as virtualized clusters, in order to take advantage of increased cost-efficiency, flexibility, scalability, improved hardware utilization and reduced carbon footprint, among others. However, due to the complexity of the application execution environment, routine tasks such as monitoring, performance analysis and debugging of applications deployed on the cloud become cumbersome and complex. These tasks often require close interaction and inspection of multiple layers in the application and system software stack. For example, when analyzing a distributed application that has been provisioned on a cluster of virtual machines, a researcher might need to monitor the execution of his program on the VMs, or the availability of physical resources to the VMs. This would require the researcher to use different sets of tools to collect and analyze performance data from each level. Otus is a tool that enables resource attribution in clusters and currently reports only the virtual resource utilization and not the physical resource utilization on virtualized clusters. This is insufficient to fully understand application behavior on a cloud platform; it would fail to account for the state of the physical infrastructure, its availability or the variation in load by other VMs on the same physical host, for example. We are extending Otus to collect metrics from multiple layers, starting with the Hypervisor. Otus can now collect information from both the VM level as well as the Hypervisor level; and this information is collected in an OpenTSDB database, which is scalable to large clusters. A web-based application allows the researcher to selectively visualize these metrics in real-time or for a particular time range in the past. We have tested our multi-layered monitoring technique on several Hadoop Mapreduce applications and clearly identified the causes of several performance problems that would otherwise not be clear using existing methods. Apart from helping researchers understand application needs, our technique could also help accelerate the development and testing of new platforms for cloud researchers.
Read full abstract