Application of Algorithm CARDBK in Document Clustering

Yehang Zhu,Feng Shi,Mingjie Zhang

doi:10.1007/s11859-018-1357-3

Yehang Zhu, Feng Shi + Show 1 more

https://doi.org/10.1007/s11859-018-1357-3

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—“centroid all rank distance (CARD)” which means that all centroids are sorted by distance value from one point and “BK” are the initials of “batch K-means”—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information (NMI). Our algorithm manifested to be more stable, linearly scalable and faster.

Full Text