Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points

Velmurugan Velmurugan

doi:10.3844/jcssp.2010.363.368

Abstract

Problem statement: Clustering is one of the most important research ar eas in the field of data mining. Clustering means creating groups of ob jects based on their features in such a way that th e objects belonging to the same groups are similar an d those belonging to different groups are dissimila r. Clustering is an unsupervised learning technique. T he main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge. Clustering algorithms can be applied in many domains. Approach: In this research, the most representative algorithms K-Mean s and K-Medoids were examined and analyzed based on their basic approach. The best algorithm i n each category was found out based on their performance. The input data points are generated by two ways, one by using normal distribution and another by applying uniform distribution. Results: The randomly distributed data points were taken as input to these algorithms and clusters are found ou t for each algorithm. The algorithms were implemented using JAVA language and the performance was analyzed based on their clustering quality. The execution time for the algorithms in each category was compar ed for different runs. The accuracy of the algorith m was investigated during different execution of the program on the input data points. Conclusion: The average time taken by K-Means algorithm is greater than the time taken by K-Medoids algorithm for both the case of normal and uniform distributions. The r esults proved to be satisfactory.

Highlights

Problem statement: Clustering is one of the most important research areas in the field of data mining
Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data (Jain and Dubes, 1988; Jain et al, 1999)
The main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge

Summary

Introduction

Problem statement: Clustering is one of the most important research areas in the field of data mining. Clustering means creating groups of objects based on their features in such a way that the objects belonging to the same groups are similar and those belonging to different groups are dissimilar. Results: The randomly distributed data points were taken as input to these algorithms and clusters are found out for each algorithm. The algorithms were implemented using JAVA language and the performance was analyzed based on their clustering quality. The accuracy of the algorithm was investigated during different execution of the program on the input data points. Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data (Jain and Dubes, 1988; Jain et al, 1999). A number of algorithms for clustering have been proposed by researchers, of which this study establishes with a comparative study of K-Means and K-Medoids clustering algorithms (Berkhin, 2002; Dunham, 2002; Han and Kamber, 2006; Xiong et al, 2009; Park et al., 2006; Khan and Ahmad, 2004; Borah and Ghose, 2009; Rakhlin and Caponnetto, 2007)

Methods

Results

Discussion

Conclusion