Genetic Algorithm with an Improved Initial Population Technique for Automatic Clustering of Low-Dimensional Data

Xiangbing Zhou,Fang Miao,Hongjiang Ma

doi:10.3390/info9040101

Abstract

K-means clustering is an important and popular technique in data mining. Unfortunately, for any given dataset (not knowledge-base), it is very difficult for a user to estimate the proper number of clusters in advance, and it also has the tendency of trapping in local optimum when the initial seeds are randomly chosen. The genetic algorithms (GAs) are usually used to determine the number of clusters automatically and to capture an optimal solution as the initial seeds of K-means clustering or K-means clustering results. However, they typically choose the genes of chromosomes randomly, which results in poor clustering results, whereas a generally selected initial population can improve the final clustering results. Hence, some GA-based techniques carefully select a high-quality initial population with a high complexity. This paper proposed an adaptive GA (AGA) with an improved initial population for K-means clustering (SeedClust). In SeedClust, which is an improved density estimation method and the improved K-means++ are presented to capture higher quality initial seeds and generate the initial population with low complexity, and the adaptive crossover and mutation probability is designed and is then used for premature convergence and to maintain the population diversity, respectively, which can automatically determine the proper number of clusters and capture an improved initial solution. Finally, the best chromosomes (centers) are obtained and are then fed into the K-means as initial seeds to generate even higher quality clustering results by allowing the initial seeds to readjust as needed. Experimental results based on low-dimensional taxi GPS (Global Position System) data sets demonstrate that SeedClust has a higher performance and effectiveness.

Highlights

Data clustering is an important and well-known technique in the area of unsupervised machine learning
The experiments are implemented on four real-world taxi GPS data sets, which can be often used for testing clustering results
For the purpose of comparison, Silhouette coefficient (SC), SSE, Davies-Bouldin index (DBI), and PBM are used to evaluate the performance of the clustering results of the taxi GPS data sets (Tables 2–5, respectively), the tables present the maximum, minimum, average value of the clustering results for each evaluation criteria, and to verify the whole performance of SeedClust, GenClust, GAK, genetic algorithms (GAs)-clustering, and K-means are compared to SeedClust in the experiment

Summary

Introduction

Data clustering is an important and well-known technique in the area of unsupervised machine learning. It is used for identifying similar records in one cluster and dissimilar records in different clusters [1,2,3,4,5]. K-means has a number of well-known drawbacks that usually obtains poor results where clusters have different sizes and shapes These drawbacks mainly include requiring a user to provide the number of clusters K as an input [2,10], which is generally very sensitive to the quality of the initial seeds, and produces poor quality results due to the poor quality of the initial seeds [10,11]. The authors in [4]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Apr 21, 2018
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Genetic Algorithm with an Improved Initial Population Technique for Automatic Clustering of Low-Dimensional Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

An Automatic K-Means Clustering Algorithm of GPS Data Combining a Novel Niche Genetic Algorithm with Noise and Density
Xiangbing Zhou ... Jianggang Gu
ISPRS International Journal of Geo-Information | VOL. 6
Xiangbing Zhou, et. al.Xiangbing Zhou ... Jianggang Gu
01 Dec 2017
ISPRS International Journal of Geo-Information | VOL. 6

An adaptive genetic algorithm with diversity-guided mutation and its global convergence property
Mei-Yi Li ... Guo-Yun Sun
Journal of Central South University of Technology | VOL. 11
Mei-Yi Li, et. al.Mei-Yi Li ... Guo-Yun Sun
01 Sep 2004
Journal of Central South University of Technology | VOL. 11

Adaptive probabilities of crossover and mutation in genetic algorithms
M Srinivas ... L.M Patnaik
IEEE Transactions on Systems, Man, and Cybernetics | VOL. 24
M Srinivas, et. al.M Srinivas ... L.M Patnaik
01 Apr 1994
IEEE Transactions on Systems, Man, and Cybernetics | VOL. 24

A novel clustering algorithm combining niche genetic algorithm with canopy and K-means
Hua Zhang ... Xiangbing Zhou
-
Hua Zhang, et. al.Hua Zhang ... Xiangbing Zhou
01 May 2018
01 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Genetic Algorithm with an Improved Initial Population Technique for Automatic Clustering of Low-Dimensional Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information