Abstract

Determining the correct number of clusters (CNC) is an important task in data clustering and has a critical effect on nalizing the partitioning results. K-means is one of the popular methods of clustering that requires CNC. Validity index methods use an additional optimization procedure to estimate the CNC for K-means. We propose an alternative validity index approach denoted by k-Minimizing Average Central Error (KMACE). Average Central Error (ACE) is the average error between the unavailable cluster center and the estimated cluster center for each sample data. Kernel K-MACE is kernel K-means that is equipped with the proposed CNC estimator. In addition, kernel K-MACE includes an automatically tuned procedure for choosing the Gaussian kernel parameters. Simulation results for both synthetic and real data show superiority of K-MACE and kernel K-MACE over the conventional clustering methods not only in CNC estimation but also in the partitioning procedure.

Highlights

  • Clustering is one of the most used unsupervised learning tasks where unlabeled observed data samples are grouped based on their similarities and dissimilarities

  • The Kernel parameter governs the separability of clusters in feature space and its optimum value corresponds to the true estimation of correct number of clusters (CNC). We propose another important feature in the kernel k-minimizing ACE (K-MACE) clustering algorithm that automatically tunes to the optimum Gaussian kernel parameters

  • We compare K-MACE and kernel-k-Minimizing Average Central Error (KMACE) with two divisive hierarchical clustering methods that are partitioning clustering schemes, G-means [16] that is mainly proposed for Gaussian clustering, as well as Dip-means [17] that a most recent approach

Read more

Summary

INTRODUCTION

Clustering is one of the most used unsupervised learning tasks where unlabeled observed data samples are grouped based on their similarities and dissimilarities. While clustering assignment on K-means is based on the distance of a sample to its cluster center, another family of clustering algorithms are density based where clusters are formed by grouping samples based on their proximity with respect to their neighboring samples These methods provide the CNC estimate simultaneously. The existing validity index methods, used with K-means, employ the available cluster compactness for CNC estimate to optimize a criterion that is not similar to the K-means partitioning criterion. Note that in addition to the number of clusters m, existing Kernel based clustering methods require tuning the kernel function parameters This is currently done by trial and error and no method of validation and choosing the optimum parameters is available.

PROBLEM STATEMENT
M-CLUSTERING NOTATIONS
ACE IN M-CLUSTERING
MEAN AND VARIANCE OF ACE
ACE MEAN ESTIMATION
BOUNDS ON ACE
VALIDATION AND CONFIDENCE PROBABILITIES
K-MACE USING THE AVAILABLE DATA
INITIAL CLUSTER ASSIGNMENT IN KERNEL K-MACE
K-MACE IN FEATURE SPACE
OPTIMUM GAUSSIAN KERNEL PARAMETER
SIMULATIONS AND RESULTS
REAL DATA
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.