In the last decade, efficient data analysis of data-intensive applications has become an important research issue. Hadoop is the most widely used platform for data intensive application. However, majority of data placement strategies attempt placing related-data close to each other for faster access without considering new datasets, generated or for different MapReduce jobs. This paper deals with improving the map-reduce performance over multi-cluster datasets by means of a novel-entropy-based data placement strategy (EDPS) in three-phases. K-means clustering strategy is employed to extract dependencies among different datasets and group them into data-groups. Then these data-groups are placed in different datacenters while considering heterogeneity. Finally, an entropy-based grouping of the newly generated datasets where these datasets are grouped with most similar existing cluster based on their relative entropy. The experimental results show efficacy of the proposed three-fold dynamic grouping and data placement policy, which significantly reduces the time and improve Hadoop performance.
Read full abstract