Data location aware scheduling for virtual Hadoop cluster deployment on private cloud computing environment

Asmath Fahad Thaha,Nazrul Muhaimin Ahmad,Subarmaniam Kannan,Anang Hudaya Muhamad Amin

doi:10.1109/apcc.2016.7581422

Abstract

With the advancements of Internet-of-Things (IoT) and Machine-to-Machine Communications (M2M), the ability to generate massive amount of streaming data from sensory devices in distributed environment is inevitable. A common practice nowadays is to process these data in a high-performance computing infrastructure, such as cloud. Cloud platform has the ability to deploy Hadoop ecosystem on virtual clusters. In cloud configuration with different geographical regions, virtual machines (VMs) that are part of virtual cluster are placed randomly. Prior to processing, data have to be transferred to the regional sites with VMs for data locality purposes. In this paper, a provisioning strategy with data-location aware deployment for virtual cluster will be proposed, as to localize and provision the cluster near to the storage. The proposed mechanism reduces the network distance between virtual cluster and storage, resulting in reduced job completion times.

Full Text