Abstract

Cluster analysis is a method to classify observations into several clusters. A common strategy for clustering the observations uses distance as a similarity index. However distance approach cannot be applied when data is not complete. Genetic Algorithm is applied by involving variance (GACV) in order to solve this problem. This study employed GACV on Iris data that was introduced by Sir Ronald Fisher. Clustering the incomplete data was implemented on data which was produced by deleting some values of Iris data. The algorithm was developed under R 3.0.2 software and got satisfying result for clustering complete data with 95.99% sensitivity and 98% consistency. GACV could be applied to cluster observations with missing value without filling in the missing value or excluding these observations. Performance on clustering incomplete observations is also satisfying but tends to decrease as the proportion of incomplete values increases. The proportion of incomplete values should be less than or equal to 40% to get sensitivity and consistency not less than 90.
 Keywords: Cluster Analysis, Genetic Algorithm, Incomplete Data.

Highlights

  • Cluster analysis is an important technique in a wide variety of fields, such as psychology, economics, biology, bioinformatics, medicine, business and marketing, social science, world wide web, and data mining

  • Zadeh et al (2011) applied cluster analysis for profiling customers of a bank based on their behavior

  • The cluster result of GACV could be compared with the correct cluster to assess its performance since the species of this data is given

Read more

Summary

Introduction

Cluster analysis is an important technique in a wide variety of fields, such as psychology, economics, biology, bioinformatics, medicine, business and marketing, social science, world wide web, and data mining. Zadeh et al (2011) applied cluster analysis for profiling customers of a bank based on their behavior. Most of clustering methods employ distance as a similarity index for clustering the observation. This index requires complete information for all observations. Sometimes we are faced with the observations that have incomplete values for some variables This will disrupt the process of calculating the distance to each observation, so that we should be filling in the missing values or excluding those observations. Filling in the missing values will result an addition error in the analysis due to estimate the missing value, whereas excluding observations will reduce the information, other than that sometimes we want to know the group from an observation these observations have incomplete value, so this technique cannot be applied. Employing a different similarity index with other approaches can overcome this problem

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call