Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Balasundaram Sadhu Ramakrishnan,Vengadeswaran Shanmugasundaram

doi:10.12694/scpe.v19i3.1375

Abstract

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Abstract

Talk to us

Similar Papers

More From: Scalable Computing: Practice and Experience

Lead the way for us

Similar Papers

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering
S Vengadeswaran ... S R Balasundaram
International Journal of Ambient Computing and Intelligence | VOL. 9
S Vengadeswaran, et. al.S Vengadeswaran ... S R Balasundaram
01 Jul 2018
International Journal of Ambient Computing and Intelligence | VOL. 9

Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications
S Vengadeswaran ... S R Balasundaram
-
S Vengadeswaran, et. al.S Vengadeswaran ... S R Balasundaram
01 Feb 2017
01 Feb 2017

Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering
S Vengadeswaran ... S R Balasundaram
-
S Vengadeswaran, et. al.S Vengadeswaran ... S R Balasundaram
29 Sep 2017
29 Sep 2017

Special Issue on Infrastructures and Algorithms for Scalable Computing
Sasko Ristov
Scalable Computing: Practice and Experience | VOL. 19
Sasko RistovSasko Ristov
17 Sep 2018
Scalable Computing: Practice and Experience | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Abstract

Talk to us

Similar Papers

More From: Scalable Computing: Practice and Experience