Adaptive density peak clustering based on dimensional-free and reverse k-nearest neighbors

Qiannan Wu,Huiyu Mu,Ruizhi Sun,Feiyu Shang,Qianqian Zhang,Li Li

doi:10.5755/j01.itc.49.3.23405

Qiannan Wu, Huiyu Mu + Show 4 more

Open Access

https://doi.org/10.5755/j01.itc.49.3.23405

Copy DOI

Abstract

Cluster analysis plays a crucial component in consumer behavior segment. The density peak clustering algorithm (DPC) is a novel density-based clustering method. However, it performs poorly in high-dimension datasets and the local density for boundary points. In addition, its fault tolerance is affected by one-step allocation strategy. To overcome these disadvantages, an adaptive density peak clustering algorithm based on dimensional-free and reverse k-nearest neighbors (ERK-DPC) is proposed in this paper. First, we compute Euler cosine distance to obtain the similarity of sample points in high-dimension datasets. Then, the adaptive local density formula is used to measure the local density of each point. Finally, the reverse k-nearest neighbor idea is added on two-step allocation strategy, which assigns the remaining points accurately and effectively. The proposed clustering algorithm is experiments on several benchmark datasets and real-world datasets. By comparing the benchmarks, the results demonstrate that the ERK-DPC algorithm superior to some state-of- the-art methods.

Highlights

With the development of information technology, an increasing number of consumption and production data have emerged
Because the high-dimensional data will be greatly affected, and the traditional Euclidean distance cannot measure the distance between sample points correctly, and this paper introduces a novel Euler cosine distance formula
The Euler cosine distance formula can measure the distance more accurately without reducing the data dimension. This distance formula can avoid the effects of noise and sparsity in high dimensional data, and can effectively represent the true distance between sample points

Summary

Introduction

With the development of information technology, an increasing number of consumption and production data have emerged. The question of how to find certain rules and consumption patterns for these large amounts of data is a problem of concern in various fields. Clustering is a research hotspot in the field of data mining, and it is a typical unsupervised learning method [14]. Clustering methods can find dense and sparse areas of data without any prior knowledge, and can understand the global distribution of data and the relationship between data attributes. Clustering has been widely applied in many fields, (e.g., pattern recognition [17], market analysis [25], image processing [8], time series analysis [19], information retrieval [35] and social networking [5], among others). According to different clustering methods, several broad categories are defined, namely, which are hierarchical-based, partitioning-based, density-based, model-based and grid-based approaches [18]

Methods

Results

Conclusion