Handling data-skewness in character based string similarity join using Hadoop

Kanak Meena,Devendra K Tayal,Oscar Castillo,Amita Jain

doi:10.1016/j.aci.2018.11.001

Abstract

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character-based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Highlights

The term “Big Data” [1,2,3,4,5,6,7,8,9,10] has turned into a buzzword and is broadly used in both research and industrial world
Data integration [3,11] can be achieved by using the String similarity join [2,3], which provides a similar pair of strings from the two-given collection of strings
This paper presents a hybrid approach by using Pass Join, the map-reduce framework with the concept of the Map-Reduce Frequency Adaptive (MRFA), to handle the basic record join

Summary

Introduction

The term “Big Data” [1,2,3,4,5,6,7,8,9,10] has turned into a buzzword and is broadly used in both research and industrial world. Big Data refers to a term which is a blend of the 4V’s, namely Volume, Variety, Veracity, and Velocity. It is closely being related to data integration, which aims at combining the various forms of data from different sources and provides a consolidated view. Data integration [3,11] can be achieved by using the String similarity join [2,3], which provides a similar pair of strings from the two-given collection of strings. The similarity of the two strings can be calculated by using their similarity functions. There are two types of similarity functions which are used to calculate the similarity viz.: character-based similarity functions [12,13,14,15,16,17,18,19,20,21,22,23] and set-based similarity functions [14,15,16,17,18]

Character-Based Similarity Functions

Set-Based Similarity Functions

Material and methods

Stage 1

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Computing and Informatics	Publication Date: Nov 16, 2018
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

Handling data-skewness in character based string similarity join using Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Computing and Informatics

Lead the way for us

Similar Papers

Parallelizing String Similarity Join Algorithms
Ling-Chih Yao ... Lipyeow Lim
-
Ling-Chih Yao, et. al.Ling-Chih Yao ... Lipyeow Lim
01 Jan 2018
01 Jan 2018

Efficient and Scalable Processing of String Similarity Join
Chuitian Rong ... Anthony K.H. Tung
IEEE Transactions on Knowledge and Data Engineering | VOL. 25
Chuitian Rong, et. al.Chuitian Rong ... Anthony K.H. Tung
01 Oct 2013
IEEE Transactions on Knowledge and Data Engineering | VOL. 25

FrepJoin: an efficient partition-based algorithm for edit similarity join
Ji-Zhou Luo ... Hong-Zhi Wang
Journal of Zhejiang University SCIENCE C | VOL. 18
Ji-Zhou Luo, et. al.Ji-Zhou Luo ... Hong-Zhi Wang
01 Oct 2017
Journal of Zhejiang University SCIENCE C | VOL. 18

Hash$$^{ed}$$-Join: Approximate String Similarity Join with Hashing
Peisen Yuan ... Yi Sun
-
Peisen Yuan, et. al.Peisen Yuan ... Yi Sun
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Handling data-skewness in character based string similarity join using Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Computing and Informatics