Abstract

K-means clustering is an important and popular technique in data mining. Unfortunately, for any given dataset (not knowledge-base), it is very difficult for a user to estimate the proper number of clusters in advance, and it also has the tendency of trapping in local optimum when the initial seeds are randomly chosen. The genetic algorithms (GAs) are usually used to determine the number of clusters automatically and to capture an optimal solution as the initial seeds of K-means clustering or K-means clustering results. However, they typically choose the genes of chromosomes randomly, which results in poor clustering results, whereas a generally selected initial population can improve the final clustering results. Hence, some GA-based techniques carefully select a high-quality initial population with a high complexity. This paper proposed an adaptive GA (AGA) with an improved initial population for K-means clustering (SeedClust). In SeedClust, which is an improved density estimation method and the improved K-means++ are presented to capture higher quality initial seeds and generate the initial population with low complexity, and the adaptive crossover and mutation probability is designed and is then used for premature convergence and to maintain the population diversity, respectively, which can automatically determine the proper number of clusters and capture an improved initial solution. Finally, the best chromosomes (centers) are obtained and are then fed into the K-means as initial seeds to generate even higher quality clustering results by allowing the initial seeds to readjust as needed. Experimental results based on low-dimensional taxi GPS (Global Position System) data sets demonstrate that SeedClust has a higher performance and effectiveness.

Highlights

  • Data clustering is an important and well-known technique in the area of unsupervised machine learning

  • The experiments are implemented on four real-world taxi GPS data sets, which can be often used for testing clustering results

  • For the purpose of comparison, Silhouette coefficient (SC), SSE, Davies-Bouldin index (DBI), and PBM are used to evaluate the performance of the clustering results of the taxi GPS data sets (Tables 2–5, respectively), the tables present the maximum, minimum, average value of the clustering results for each evaluation criteria, and to verify the whole performance of SeedClust, GenClust, GAK, genetic algorithms (GAs)-clustering, and K-means are compared to SeedClust in the experiment

Read more

Summary

Introduction

Data clustering is an important and well-known technique in the area of unsupervised machine learning. It is used for identifying similar records in one cluster and dissimilar records in different clusters [1,2,3,4,5]. K-means has a number of well-known drawbacks that usually obtains poor results where clusters have different sizes and shapes These drawbacks mainly include requiring a user to provide the number of clusters K as an input [2,10], which is generally very sensitive to the quality of the initial seeds, and produces poor quality results due to the poor quality of the initial seeds [10,11]. The authors in [4]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.