Abstract

Cluster analysis is a multivariate data mining technique that is widely used in several areas. It aims to group automatically the n elements of a database into k clusters, using only the information of the variables of each case. However, the accuracy of the final clusters depends on the clustering method used. In this paper, we present an evaluation of the performance of main methods for cluster analysis as Ward, K-means, and Self-Organizing Maps. Differently from many studies published in the area, we generated the datasets using the Design of Experiment (DOE) technique, in order to achieve reliable conclusions about the methods through the generalization of the different possible data structures. We considered the number of variables and clusters, dataset size, sample size, cluster overlapping, and the presence of outliers, as the DOE factors. The datasets were analyzed by each clustering method and the clustering partitions were compared by the Attribute Agreement Analysis, providing invaluable information about the effects of the considered factors individually and about their interactions. The results showed that, the number of clusters, overlapping, and the interaction between sample size and number of variable significantly affect all the studied methods. Moreover, it is possible to state that the methods have similar performances, with a significance level of 5%, and it is not possible to affirm that one outperforms the others.

Highlights

  • Cluster analysis, known as unsupervised classification, is a multivariate statistical data mining technique [1]–[3], based only on variable information that aims to separate a set of objects into different clusters in which each one must contain similar objects according to some distance function statistics and, at the same time, dissimilar to the objects of other clusters

  • ANALYSIS OF RESULTS Ward and K-means were implemented using Minitab R, and Self-Organizing Methods (SOM) network was implemented by using Statistica R

  • Cluster analysis is widely used in several areas to solve important real problems, and the accuracy of the final solution depends on the clustering method used

Read more

Summary

Introduction

Known as unsupervised classification, is a multivariate statistical data mining technique [1]–[3], based only on variable information that aims to separate a set of objects into different clusters in which each one must contain similar objects according to some distance function statistics and, at the same time, dissimilar to the objects of other clusters. Methods of clustering analysis have been developed due to the need to analyze the large amount of data collected in various areas of knowledge [4], e.g.: marketing, identifying market share; medicine, identifying patients with a common disease cause; education, measuring psychological characteristics to identify groups of. There are papers about the use of clustering algorithms to identify characteristics of people with attempted suicide [6]; to facilitate the diagnosis and treatment of cancer [7]; to identify residential and social patterns of homeless adults [8]; and in applications in the field of production engineering, e.g., clustering method for production planning [9], [10], and for analyzing product portfolios [11]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.