A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Xia Que,Ning An,Siyuan Jiang,Jiaoyun Yang

doi:10.3390/a14060184

Abstract

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Highlights

The main purposes of clustering analyses are to discover the implicit class structure in the data and divide the physical or abstract objects into different classes, where the similarity between a pair of objects in the same class is large and in different classes is small
To test the effectiveness of the similarity measurement with the entropy-based weighting proposed in this paper, two different types of datasets, mixed and numerical datasets, were selected from the UCI Machine Learning Data Repository [26], and most datasets were collected from the field of biology and medicine
The iterative clustering algorithm based on the proposed similarity measurement was compared with existing clustering algorithms, including OCIL [12], K-Prototype [9] and k-means [4]. k-means was used for dataset made of numerical variables only

Summary

Introduction

The main purposes of clustering analyses are to discover the implicit class structure in the data and divide the physical or abstract objects into different classes, where the similarity between a pair of objects in the same class is large and in different classes is small. As a major exploratory data analysis tool, clustering analysis has been widely researched and applied in many fields, such as sociology, biology, medicine, etc. Most current methods are designed to address single dataset types (numerical or categorical). Classical clustering methods, such as the k-means algorithm [4,5], the EM algorithm [6], etc., are limited to numerical datasets, while some algorithms are proposed for clustering categorical datasets [7,8]. In the medical and biology fields, many datasets are collected with both numerical and categorical attributes. Many researchers are dedicated to discovering clustering algorithms for mixed types of datasets with categorical and numerical attributes [9,10]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Jun 15, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

A Unified Metric for Categorical and Numerical Attributes in Data Clustering
Yiu-Ming Cheung ... Hong Jia
-
Yiu-Ming Cheung, et. al.Yiu-Ming Cheung ... Hong Jia
01 Jan 2013
01 Jan 2013

Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number
Yiu-Ming Cheung ... Hong Jia
Pattern Recognition | VOL. 46
Yiu-Ming Cheung, et. al.Yiu-Ming Cheung ... Hong Jia
31 Jan 2013
Pattern Recognition | VOL. 46

Determining the number of clusters using information entropy for mixed data
Jiye Liang ... Deyu Li
Pattern Recognition | VOL. 45
Jiye Liang, et. al.Jiye Liang ... Deyu Li
24 Dec 2011
Pattern Recognition | VOL. 45

An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets
Kang Zhang ... Xingsheng Gu
Mathematical Problems in Engineering | VOL. 2014
Kang Zhang, et. al.Kang Zhang ... Xingsheng Gu
01 Jan 2014
Mathematical Problems in Engineering | VOL. 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms