Abstract
It is reported in this paper, the results of a study of the partitioning around medoids (PAM) clustering algorithm applied to four datasets, both standardized and not, and of varying sizes and numbers of clusters. The angular distance proximity measure in addition to the two more traditional proximity measures, namely the Euclidean distance and Manhattan distance, was used to compute object-object similarity. The data used in the study comprise three widely available datasets, and one that was constructed from publicly available climate data. Results replicate some of the well known facts about the PAM algorithm, namely that the quality of the clusters generated tend to be much better for small datasets, that the silhouette value is a good, even if not perfect, guide for the optimal number of clusters to generate, and that human intervention is required to interpret generated clusters. Additionally, results also indicate that the angular distance measure, which traditionally has not been widely used in clustering, outperforms both the Euclidean and Manhattan distance metrics in certain situations.Keywords: PAM, Euclidean, Manhattan, Angular distance, Silhouette
Highlights
IntroductionCluster analysis (or clustering) is an unsupervised machine learning task used to find structure in unlabelled data
Cluster analysis is an unsupervised machine learning task used to find structure in unlabelled data
Interpretation of generated clusters often requires human intervention to explain patterns that are common to members of the clusters
Summary
Cluster analysis (or clustering) is an unsupervised machine learning task used to find structure in unlabelled data. The clustering task groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other clusters (Aldenderfer and Blashfield, 1984; Han et al, 2006). Several clustering approaches have been developed to address different types of data. These include: partitioning approaches, hierarchical approaches, density-based methods, grid-based methods, model-based methods, special techniques for clustering high-dimensional data, and constraint-based clustering (Han et al, 2006; Yinghua et al, 2016).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.