Abstract

Clustering is an important technique in data mining and knowledge discovery. Affinity propagation clustering (AP) and density peaks and distance-based clustering (DDC) are two significant clustering algorithms proposed in 2007 and 2014 respectively. The two clustering algorithms have simple and clear design ideas, and are effective in finding meaningful clustering solutions. They have been widely used in various applications successfully. However, a key disadvantage of AP is its high time complexity, which has become a bottleneck when applying AP for large-scale problems. The core idea of DDC is to construct the decision graph based on the local density and the distance of each data point, and then select the cluster centers, but the selection of the cluster centers is relatively subjective, and sometimes it is difficult to determine a suitable number of cluster centers. Here, we propose a two-stage clustering algorithm, called DDAP, to overcome these shortcomings. First, we select a small number of potential exemplars based on the two quantities of each data point in DDC to greatly compress the scale of the similarity matrix. Then we implement message-passing on the incomplete similarity matrix. In experiments, two synthetic datasets, nine publicly available datasets, and a real-world electronic medical records (EMRs) dataset are used to evaluate the proposed method. The results demonstrate that DDAP can achieve comparable clustering performance with the original AP algorithm, while the computational efficiency improves observably.

Highlights

  • Clustering is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized [1]

  • We propose a two-stage fast Affinity propagation clustering (AP) clustering algorithm DDAP, which can largely improve the efficiency of the AP algorithm while achieving comparable clustering performance with the original AP

  • The results demonstrate that DDAP can achieve comparable clustering performances with the original AP algorithm, while the computational efficiency improves observably

Read more

Summary

Introduction

Clustering is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized [1]. Clustering is used for two aims: (a) receiving a primary understanding of raw data and (b) reducing the size of a huge amount of raw data [2]. Because of the importance of clustering, a large number of clustering algorithms have been proposed and applied widely in many domains [3], [4]. Affinity propagation clustering (AP) [5] and density peaks and distance-based clustering (DDC) [6] are two significant clustering algorithms proposed in 2007 and 2014 respectively. The implementation of an exemplar-based clustering is to find some representative data points called exemplars as centers and assign the remaining data points to their nearest centers [7].

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.