Abstract
<p>This thesis focuses on clustering for the purpose of unsupervised learning. One topic of our interest is on estimating the correct number of clusters (CNC). In conventional clustering approaches, such as X-means, G-means, PG-means and Dip-means, estimating the CNC is a preprocessing step prior to finding the centers and clusters. In another word, the first step estimates the CNC and the second step finds the clusters. Each step having different objective function to minimize. Here, we propose minimum averaged central error (MACE)-means clustering and use one objective function to simultaneously estimate the CNC and provide the cluster centers. We have shown superiority of MACEmeans over the conventional methods in term of estimating the CNC with comparable complexity. In addition, on average MACE-means results in better values for adjusted rand index (ARI) and variation of information (VI). Next topic of our interest is order selection step of the conventional methods which is usually a statistical testing method such as Kolmogrov-Smrinov test, Anderson-Darling test, and Hartigan's Dip test. We propose a new statistical test denoted by Sigtest (signature testing). The conventional statistical testing approaches rely on a particular assumption on the probability distribution of each cluster. Sigtest on the other hand can be used with any prior distribution assumption on the clusters. By replacing the statistical testing of the mentioned conventional approaches with Sigtest, we have shown that the clustering methods are improved in terms of having more accurate CNC as well as ARI and VI. Conventional clustering approaches fail in arbitrary shaped clustering. Our last contribution of the thesis is in arbitrary shaped clustering. The proposed method denoted by minimum Pathways is Arbitrary Shaped (minPAS) clustering is proposed based on a unique minimum spanning tree structure of the data. Our simulation results show advantage of minPAS over the state-of-the-art arbitrary shaped clustering methods such as DBSCAN and Affinity Propagation in terms of accuracy, ARI and VI indexes.</p>
Highlights
Clustering has wide range of applications in different disciplines of science and engineering such as bioinformatics, genetics, image segmentation [1], voice recognition, document classification and weather classification [2–4]
Note that minimum averaged central error (MACE)-means dependency on the assumption of having the same variance in clusters is a disadvantage of the method which should be addressed in the future work
Be used with other clustering methods. Another potential future work will be extending the MACE fundamentals to use with clustering methods with wider range of assumptions beyond the spherical Gaussian
Summary
Clustering has wide range of applications in different disciplines of science and engineering such as bioinformatics, genetics, image segmentation [1], voice recognition, document classification and weather classification [2–4]. The goal of a clustering algorithm is to subjectively group observed data samples based on their similarity and dissimilarity [14]. In this Chapter, we briefly discuss some of the widely used clustering methods and their requirements. A more recently proposed method for the purpose of statistical testing in clustering is Hartigan’s Dip test This method generalizes the Gaussian assumption of the two above methods to a unimodal distribution. In general spectral clustering methods, K largest eigenvectors of the Laplacian of the affinity matrix will be used for partitioning data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.