Abstract
Several clustering methods have been proposed for analyzing numerous mixed-type data sets composed of numeric and categorical attributes. However, existing clustering methods are not suitable for clustering very large mixed-type data sets because they require a high computational cost or a large memory size. We propose a novel clustering method for very large data sets using a mixed-type clustering feature (MCF) vector with summary information about a cluster. The MCF vector consists of the CF vector and a histogram to summarize the mixed-type values. Based on the MCF vector, we propose an MCF tree, along with a distance measure between the MCF vectors representing two clusters. Unlike previous studies that summarize a data set based on a fixed memory size, we estimate a small initial memory size of the data set for building the tree. Then, the memory size is adaptively increased to estimate a more accurate threshold by reflecting the size reduction in the re-built tree. Our theoretical analysis demonstrates the efficiency of the proposed approach. Experimental results on very large synthetic and real data sets demonstrate that the proposed approach clusters the data sets significantly faster than existing clustering methods while maintaining similar or better clustering accuracy.
Highlights
C LUSTERING groups a data set into clusters based on the concepts of distance or similarity
We propose a scheme for summarizing mixed-type data sets for clustering using an mixed-type clustering feature (MCF) vector composed of the CF vector and histogram
We present a distance measure based on the histogram of the MCF vector
Summary
C LUSTERING groups a data set into clusters based on the concepts of distance or similarity. A summary-based clustering method for mixed-type data sets, which is an extension of the CF tree of BIRCH, was proposed in [13]. If a given memory size is not suitable for clustering mixed-type data sets, their threshold estimation method does not perform well, thereby giving poor final clusters. We propose a novel summary-based clustering method for very large mixed-type data sets by incorporating a dynamically increasing memory size for more accurate threshold estimation. Unlike previous works [13], [35], [38], [39], our method first estimates a small initial memory size for building the MCF tree by considering the size and characteristics of a given mixed-type data set.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.