Abstract

Clustering is an automated search for hidden patterns in a datasets to unveil group of related observations. The technique is one of the viable means by which the patterns or internal structure of the data within the same collection can be revealed. Choosing the right algorithm to achieve clusters of good quality is usually a challenge, especially when the number of clusters cannot be pre-determined. This study focuses on evaluating a number of selected clustering algorithms in finding quality clusters in the data sets. To achieve the central objective of this study, prominent technique in both the distance-based and the distribution-based clustering algorithm, specifically k-means and EM clustering algorithm respectively are implemented in this study. The data sets on which the algorithms were implemented comprised of 1,309 records of passenger information that boarded a ship retrieved from rapidMiner open repository. Experiments were conducted and clusters were formed based on the number of chosen partitions, k. The qualities of the clusters formed are measured using the concept of external criterion, Normalized Mutual Information (NMI), to validate all the clusters formed. The resulting output of this study shows that, the distance-based algorithm find clusters of higher quality with NMI value of 0.912 out of a maximum achievable value of 1. The experiment further reveals the average execution time it takes each algorithm to form the cluster model. The findings of this study also unveiled some useful insight into the choice of clustering algorithm as regards their support for a particular data type and the ease of execution of each algorithm. Keywords: clustering, data mining, k-means, EM-clustering, un-supervised learning.

Highlights

  • Clustering analysis is generally referred to as an unsupervised learning approach that seeks to identify or group objects based on their similarity features

  • The results of evaluating the qualities of the clusters formed with regards to the clustering algorithms implemented in this study are presented

  • In order to determine the qualities of the clusters formed, the Normalized Mutual Information (NMI) is computed

Read more

Summary

Introduction

Clustering analysis is generally referred to as an unsupervised learning approach that seeks to identify or group objects based on their similarity features. Clustering techniques using the K-means (Suh, 2012) and Kmedoids algorithms (Berkhin, 2006), are typical distanced-based approaches. There is an attempt to reproduce the observed realization of data points as a mix of predefined probability distribution functions (McLachlan et al, 2008). The descriptive technique is useful in several areas especially classification purposes. It is an unsupervised learning technique as it group data object without consulting class labels (Han et al, 2012); It automatically unveils the hidden features or the patterns in the dataset

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.