Abstract

The K-means algorithm has been extensively investigated in the field of text clustering because of its linear time complexity and adaptation to sparse matrix data. However, it has two main problems, namely, the determination of the number of clusters and the location of the initial cluster centres. In this study, we propose an improved K-means++ algorithm based on the Davies-Bouldin index (DBI) and the largest sum of distance called the SDK-means++ algorithm. Firstly, we use the term frequency-inverse document frequency to represent the data set. Secondly, we measure the distance between objects by cosine similarity. Thirdly, the initial cluster centres are selected by comparing the distance to existing initial cluster centres and the maximum density. Fourthly, clustering results are obtained using the K-means++ method. Lastly, DBI is used to obtain optimal clustering results automatically. Experimental results on real bank transaction volume data sets show that the SDK-means++ algorithm is more effective and efficient than two other algorithms in organising large financial text data sets. The F-measure value of the proposed algorithm is 0.97. The running time of the SDK-means++ algorithm is reduced by 42.9% and 22.4% compared with that for K-means and K-means++ algorithms, respectively.

Highlights

  • Clustering is the process of dividing a data set into clusters so that the objects in the same cluster are similar to each other and the objects in different clusters are dissimilar

  • Partition-based algorithms are widely used in various fields because of their easy implementation [4]. e most typical partitional method is K-means [2]. e K-means algorithm can adapt to sparse matrix data sets, and it is efficient in organising large data sets

  • The proposed algorithm generates a feature vector space based on term frequency-inverse document frequency (TF-IDF) and uses cosine similarity to calculate the vector distance. en, the algorithm selects the initial cluster centres based on the largest sum of the distance to all existing initial cluster centres

Read more

Summary

Introduction

Clustering is the process of dividing a data set into clusters (subsets) so that the objects in the same cluster are similar to each other and the objects in different clusters are dissimilar. Huan et al [5] proposed using KL divergence to calculate the similarity between cluster centres and text data objects, thereby making the K-means algorithm increasingly efficient and effective. To reduce the number of iterations and avoid falling into the local optimum, many scholars proposed optimising the selection methods of initial clustering centres directly. E newly defined decision graph can help the DP algorithm avoid noise interference when selecting the initial cluster centres Other improved methods, such as the semisupervised clustering algorithm based on pairwise constraints, can enhance clustering performance. We use the Davies–Bouldin index (DBI) to evaluate the clustering results and obtain the optimal number of clusters It is efficient and improves clustering accuracy when organising many data sets.

Related Work
Classic Clustering Algorithms Based on Partition
Validation methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call