Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data

Changki Lee,Uk Jung

doi:10.3390/app11188416

Abstract

Measuring the dissimilarity between two observations is the basis of many data mining and machine learning algorithms, and its effectiveness has a significant impact on learning outcomes. The dissimilarity or distance computation has been a manageable problem for continuous data because many numerical operations can be successfully applied. However, unlike continuous data, defining a dissimilarity between pairs of observations with categorical variables is not straightforward. This study proposes a new method to measure the dissimilarity between two categorical observations, called a context-based geodesic dissimilarity measure, for the categorical data clustering problem. The proposed method considers the relationships between categorical variables and discovers the implicit topological structures in categorical data. In other words, it can effectively reflect the nonlinear patterns of arbitrarily shaped categorical data clusters. Our experimental results confirm that the proposed measure that considers both nonlinear data patterns and relationships among the categorical variables yields better clustering performance than other distance measures.

Highlights

The measurement of the distance or dissimilarity between two data observations plays an important role in clustering
We propose the context-based geodesic dissimilarity (CGD) measure, which is useful for clustering categorical data that exhibit (1) correlations and (2) the manifold structures in the dataset
We conducted experiments to study the characteristics of the proposed method (CGD) and compared it with other conventional categorical distance measures in the literature: Gower distance (GD) [5], association-based dissimilarity (AD) [8], and a variant of the geodesic distance using Gower distance (hereafter, Gower-based geodesic distance (GGD))

Summary

Introduction

The measurement of the distance or dissimilarity between two data observations plays an important role in clustering. Various distance measures have been proposed for continuous data. K-means clustering is one of the easiest and classical methods that use the Euclidean distance. The Euclidean distance cannot work when the dataset is composed of categorical variables. The business intelligence community is overwhelmed with a large collection of categorical data such as those collected from the banks, health sector, web-log, and biological sequences [2]. Banking sector or health sector data primarily contain categorical variables such as sex, smoking, and marital status. Clustering categorical data into meaningful groups is a challenging problem because it is difficult to define the distance measures that are efficiently reflected in the data characteristics

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Sep 10, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Finding the Smoothest Path to Success: Model Complexity and the Consideration of Nonlinear Patterns in Nest-Survival Data
Max Post Van Der Burg ... Andrew J Tyre
The Condor | VOL. 112
Max Post Van Der Burg, et. al.Max Post Van Der Burg ... Andrew J Tyre
01 Aug 2010
The Condor | VOL. 112

A Geometrical Framework for Covariance Matrices of Continuous and Categorical Variables
Graziano Vernizzi ... Miki Nakai
Sociological Methods & Research | VOL. 44
Graziano Vernizzi, et. al.Graziano Vernizzi ... Miki Nakai
25 Aug 2014
Sociological Methods & Research | VOL. 44

Visualization and clustering of categorical data with probabilistic self-organizing map
Mustapha Lebbah ... Khalid Benabdeslem
Neural Computing and Applications | VOL. 19
Mustapha Lebbah, et. al.Mustapha Lebbah ... Khalid Benabdeslem
10 Sep 2009
Neural Computing and Applications | VOL. 19

Bio inspired Ensemble Feature Selection (BEFS) Model with Machine Learning and Data Mining Algorithms for Disease Risk Prediction
Syed Javeed Pasha ... E Syed Mohamed
-
Syed Javeed Pasha, et. al.Syed Javeed Pasha ... E Syed Mohamed
01 Sep 2019
01 Sep 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences