Abstract

Several clustering methods have been proposed for analyzing numerous mixed-type data sets composed of numeric and categorical attributes. However, existing clustering methods are not suitable for clustering very large mixed-type data sets because they require a high computational cost or a large memory size. We propose a novel clustering method for very large data sets using a mixed-type clustering feature (MCF) vector with summary information about a cluster. The MCF vector consists of the CF vector and a histogram to summarize the mixed-type values. Based on the MCF vector, we propose an MCF tree, along with a distance measure between the MCF vectors representing two clusters. Unlike previous studies that summarize a data set based on a fixed memory size, we estimate a small initial memory size of the data set for building the tree. Then, the memory size is adaptively increased to estimate a more accurate threshold by reflecting the size reduction in the re-built tree. Our theoretical analysis demonstrates the efficiency of the proposed approach. Experimental results on very large synthetic and real data sets demonstrate that the proposed approach clusters the data sets significantly faster than existing clustering methods while maintaining similar or better clustering accuracy.

Highlights

  • C LUSTERING groups a data set into clusters based on the concepts of distance or similarity

  • We propose a scheme for summarizing mixed-type data sets for clustering using an mixed-type clustering feature (MCF) vector composed of the CF vector and histogram

  • We present a distance measure based on the histogram of the MCF vector

Read more

Summary

INTRODUCTION

C LUSTERING groups a data set into clusters based on the concepts of distance or similarity. A summary-based clustering method for mixed-type data sets, which is an extension of the CF tree of BIRCH, was proposed in [13]. If a given memory size is not suitable for clustering mixed-type data sets, their threshold estimation method does not perform well, thereby giving poor final clusters. We propose a novel summary-based clustering method for very large mixed-type data sets by incorporating a dynamically increasing memory size for more accurate threshold estimation. Unlike previous works [13], [35], [38], [39], our method first estimates a small initial memory size for building the MCF tree by considering the size and characteristics of a given mixed-type data set.

RELATED WORKS
MEASURES FOR CLUSTERS We compute the distance between two clusters as follows
CATEGORICAL DISTANCE MEASURE BASED ON HISTOGRAM
PERFORMANCE ANALYSIS
Summary scheme
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call