A Clustering Analysis Method With High Reliability Based on Wilcoxon-Mann-Whitney Testing

Yuan Cheng,Weinan Jia,Ao Li,Ronghua Chi

doi:10.1109/access.2021.3053244

Abstract

As a core step in clustering analysis, distance measurement results can influence clustering accuracy. Existing measurement methods are mostly based on cluster feature information. However, these cluster features may be insufficient and result in losing data information for clusters containing a number of objects. To improve measurement accuracy, we make full use of the distribution characteristics of objects in clusters, i.e., we use descriptive statistics and the Wilcoxon-Mann-Whitney rank sum test in nonparametric statistics to measure distances during clustering. Furthermore, we propose a two-stage clustering algorithm to improve clustering analysis performance. In terms of avoiding preliminarily assuming the number of clusters, with the proposed distance measurement method, the clustering algorithm can discover clusters with arbitrary shapes and improve clustering accuracy. Experiments with multiple datasets compared with other clustering algorithms illustrate the accuracy and efficiency of the proposed clustering algorithm.

Highlights

As a basic data mining strategy, clustering analysis is significant for discovering the characteristics of data aggregation, which is an unsupervised process [1]–[3]
When the data distribution is unknown, the clustering method is effective at obtaining the inherent distribution of data [4]–[6]
There are different ways to obtain data groups, such as the partitioning clustering method, hierarchical clustering method, density-based clustering method, The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojun Li

Summary

INTRODUCTION

As a basic data mining strategy, clustering analysis is significant for discovering the characteristics of data aggregation, which is an unsupervised process [1]–[3]. Reference [27] defined a core set to measure distances using the Birch concept They chose a number of objects as representative cluster information, this was insufficient and resulted in information loss. Reference [28] obtained the distribution features of clusters based on a probability density function If two sets represented by two clusters are from the same population, they can be grouped into one cluster Through this method, we can reserve the original cluster information features, analyze the dissimilarity between clusters directly based on the distribution features of their data, and determine whether to merge them into one cluster without a hypothesis of the overall distribution form. An experiment on a real dataset illustrates the practicability of the proposed method and further proves that this method can facilitate the reliability of obtaining the inherent distribution of data

DISTANCE MEASUREMENT BASED ON NONPARAMETRIC STATISTICS

EXPERIMENTS

Findings

TWO-DIMENSIONAL DATASETS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 24	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Clustering Analysis Method With High Reliability Based on Wilcoxon-Mann-Whitney Testing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Clustering Analysis Method Based on Wilcoxon-Mann-Whitney Testing
Yuan Cheng ... Weinan Jia
-
Yuan Cheng, et. al.Yuan Cheng ... Weinan Jia
01 Jan 2020
01 Jan 2020

A Data Distribution View of Clustering Algorithms
Junjie Wu ... Hui Xiong
-
Junjie Wu, et. al.Junjie Wu ... Hui Xiong
01 Jan 2009
01 Jan 2009

A Hybrid Clustering Algorithm
Sheng-Yi Jiang ... Xia Li
-
Sheng-Yi Jiang, et. al.Sheng-Yi Jiang ... Xia Li
01 Jan 2009
01 Jan 2009

Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables
J W Bartlett ... C Frost
Ultrasound in Obstetrics & Gynecology | VOL. 31
J W Bartlett, et. al.J W Bartlett ... C Frost
27 Feb 2008
Ultrasound in Obstetrics & Gynecology | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Clustering Analysis Method With High Reliability Based on Wilcoxon-Mann-Whitney Testing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access