A Clustering-Oriented Closeness Measure Based on Neighborhood Chain and Its Application in the Clustering Ensemble Framework Based on the Fusion of Different Closeness Measures.

Shaoyi Liang,Deqiang Han

doi:10.3390/s17102226

Abstract

Closeness measures are crucial to clustering methods. In most traditional clustering methods, the closeness between data points or clusters is measured by the geometric distance alone. These metrics quantify the closeness only based on the concerned data points’ positions in the feature space, and they might cause problems when dealing with clustering tasks having arbitrary clusters shapes and different clusters densities. In this paper, we first propose a novel Closeness Measure between data points based on the Neighborhood Chain (). Instead of using geometric distances alone, measures the closeness between data points by quantifying the difficulty for one data point to reach another through a chain of neighbors. Furthermore, based on , we also propose a clustering ensemble framework that combines and geometric-distance-based closeness measures together in order to utilize both of their advantages. In this framework, the “bad data points” that are hard to cluster correctly are identified; then different closeness measures are applied to different types of data points to get the unified clustering results. With the fusion of different closeness measures, the framework can get not only better clustering results in complicated clustering tasks, but also higher efficiency.

Highlights

Clustering is an important topic in machine learning, which aims to discover similar data and group them into clusters
We proposed Closeness Measure based on the Neighborhood Chain (CMNC) to deal with the problems brought by the closeness measures based on geometric distance and achieved good results in clustering tasks with arbitrary cluster shapes and different cluster densities
This paper proposes a novel closeness measure between data points based on the neighborhood chain called CMNC

Summary

Introduction

Clustering is an important topic in machine learning, which aims to discover similar data and group them into clusters. The centroid-based methods, density-based methods and connectivity-based methods are the most commonly used in practice (such a categorization is according to the different cluster models employed). The most well-known clustering methods include the k-means [13], DBSCAN [14], CURE (Clustering Using REpresentatives) [15], etc. They respectively belong to the three aforementioned categories. There are many recent works focused on improving the performance of the classic clustering schemes [16,17,18,19], or exploiting novel clustering methods using different closeness measures [20,21,22,23]

Methods

Results

Conclusion