MCF Tree-Based Clustering Method for Very Large Mixed-Type Data Set

Hyeong-Cheol Ryu,Sungwon Jung

doi:10.1109/access.2021.3118411

Abstract

Several clustering methods have been proposed for analyzing numerous mixed-type data sets composed of numeric and categorical attributes. However, existing clustering methods are not suitable for clustering very large mixed-type data sets because they require a high computational cost or a large memory size. We propose a novel clustering method for very large data sets using a mixed-type clustering feature (MCF) vector with summary information about a cluster. The MCF vector consists of the CF vector and a histogram to summarize the mixed-type values. Based on the MCF vector, we propose an MCF tree, along with a distance measure between the MCF vectors representing two clusters. Unlike previous studies that summarize a data set based on a fixed memory size, we estimate a small initial memory size of the data set for building the tree. Then, the memory size is adaptively increased to estimate a more accurate threshold by reflecting the size reduction in the re-built tree. Our theoretical analysis demonstrates the efficiency of the proposed approach. Experimental results on very large synthetic and real data sets demonstrate that the proposed approach clusters the data sets significantly faster than existing clustering methods while maintaining similar or better clustering accuracy.

Highlights

C LUSTERING groups a data set into clusters based on the concepts of distance or similarity
We propose a scheme for summarizing mixed-type data sets for clustering using an mixed-type clustering feature (MCF) vector composed of the CF vector and histogram
We present a distance measure based on the histogram of the MCF vector

Summary

INTRODUCTION

C LUSTERING groups a data set into clusters based on the concepts of distance or similarity. A summary-based clustering method for mixed-type data sets, which is an extension of the CF tree of BIRCH, was proposed in [13]. If a given memory size is not suitable for clustering mixed-type data sets, their threshold estimation method does not perform well, thereby giving poor final clusters. We propose a novel summary-based clustering method for very large mixed-type data sets by incorporating a dynamically increasing memory size for more accurate threshold estimation. Unlike previous works [13], [35], [38], [39], our method first estimates a small initial memory size for building the MCF tree by considering the size and characteristics of a given mixed-type data set.

RELATED WORKS

MEASURES FOR CLUSTERS We compute the distance between two clusters as follows

CATEGORICAL DISTANCE MEASURE BASED ON HISTOGRAM

PERFORMANCE ANALYSIS

Summary scheme

Findings

CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MCF Tree-Based Clustering Method for Very Large Mixed-Type Data Set

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Journal: IEEE Access	Publication Date: Jan 1, 2021
License type: CC BY 4.0

Similar Papers

An Effective Clustering Method over CF+ Tree Using Multiple Range Queries
Hyeong-Cheol Ryu ... Sakti Pramanik
IEEE Transactions on Knowledge and Data Engineering | VOL. 32
Hyeong-Cheol Ryu, et. al.Hyeong-Cheol Ryu ... Sakti Pramanik
01 Jan 2020
IEEE Transactions on Knowledge and Data Engineering | VOL. 32

A clustering method for very large mixed data sets
G Sanchez-Diaz ... J Ruiz-Shulcloper
-
G Sanchez-Diaz, et. al.G Sanchez-Diaz ... J Ruiz-Shulcloper
29 Nov 2001
29 Nov 2001

Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry
Lukas Reiter ... Ruedi Aebersold
Molecular & Cellular Proteomics | VOL. 8
Lukas Reiter, et. al.Lukas Reiter ... Ruedi Aebersold
01 Nov 2009
Molecular & Cellular Proteomics | VOL. 8

Mapreduce-Based Distributed Clustering Method Using CF+ Tree
Hyeong-Cheol Ryu ... Sungwon Jung
IEEE Access | VOL. 8
Hyeong-Cheol Ryu, et. al.Hyeong-Cheol Ryu ... Sungwon Jung
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MCF Tree-Based Clustering Method for Very Large Mixed-Type Data Set

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access