Abstract

Density Peaks Clustering (DPC) is a density-based clustering algorithm that has the advantage of not requiring clustering parameters and detecting non-spherical clusters. The density peaks algorithm obtains the actual cluster center by inputting the cutoff distance and manually selecting the cluster center. Thus, the clustering center point is not selected on the basis of considering the whole data set. This paper proposes a method called G-KNN-DPC to calculate the cutoff distance based on the Gini coefficient and K-nearest neighbor. G-KNN-DPC first finds the optimal cutoff distance with Gini coefficient, and then the center point with the K-nearest neighbor. The automatic clustering center method can not only avoid the error that a cluster detects two center points but also effectively solve the traditional DPC algorithm defect that cannot handle complex data sets. Compared with DPC, Fuzzy C-Means, K-means, KDPC and DBSCAN, the proposed algorithm creates better clusters on different data sets.

Highlights

  • Big data has been rapidly and widely used in the fields of physics, biological engineering, life medicine etc [1]

  • Based on many improvements for the density peaks clustering algorithm [18]–[26] and outliers detection strategy [27], [28], we propose a method to calculate the cutoff distance based on the Gini coefficient and find center points by K-nearest neighbor (KNN)

  • The results demonstrate that G-KNN-Density Peaks Clustering (DPC) can consider the true distribution of a data set and has better performance

Read more

Summary

INTRODUCTION

Big data has been rapidly and widely used in the fields of physics, biological engineering, life medicine etc [1]. K-nearest neighbor (KNN) is a classification algorithm, which is simple and efficient It can deal with text and stream data classification problems [14], and shows very well in clustering and strong skill, so this method is constantly introduced into the DPC algorithm. The algorithm uses the KNN idea to estimate the density of each point and uses principal component analysis to reduce the dimensionality of the data, improving the processing ability of high-dimensional data and obtaining a good clustering effect [16]. Based on many improvements for the density peaks clustering algorithm [18]–[26] and outliers detection strategy [27], [28], we propose a method to calculate the cutoff distance based on the Gini coefficient and find center points by KNN. We derive the conclusions given in the last section along with the expected future works

RELATED WORK
CALCULATE THE CUTOFF DISTANCE BASED ON GINI COEFFICIENT
FIND THE CENTER POINTS BY USING K-NEAREST NEIGHBOR
EXPERIMENTS AND ANALYSIS
DECISION GRAPHS COMPARATIVE ANALYSIS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call