Abstract

As a core step in clustering analysis, distance measurement results can influence clustering accuracy. Existing measurement methods are mostly based on cluster feature information. However, these cluster features may be insufficient and result in losing data information for clusters containing a number of objects. To improve measurement accuracy, we make full use of the distribution characteristics of objects in clusters, i.e., we use descriptive statistics and the Wilcoxon-Mann-Whitney rank sum test in nonparametric statistics to measure distances during clustering. Furthermore, we propose a two-stage clustering algorithm to improve clustering analysis performance. In terms of avoiding preliminarily assuming the number of clusters, with the proposed distance measurement method, the clustering algorithm can discover clusters with arbitrary shapes and improve clustering accuracy. Experiments with multiple datasets compared with other clustering algorithms illustrate the accuracy and efficiency of the proposed clustering algorithm.

Highlights

  • As a basic data mining strategy, clustering analysis is significant for discovering the characteristics of data aggregation, which is an unsupervised process [1]–[3]

  • When the data distribution is unknown, the clustering method is effective at obtaining the inherent distribution of data [4]–[6]

  • There are different ways to obtain data groups, such as the partitioning clustering method, hierarchical clustering method, density-based clustering method, The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojun Li

Read more

Summary

INTRODUCTION

As a basic data mining strategy, clustering analysis is significant for discovering the characteristics of data aggregation, which is an unsupervised process [1]–[3]. Reference [27] defined a core set to measure distances using the Birch concept They chose a number of objects as representative cluster information, this was insufficient and resulted in information loss. Reference [28] obtained the distribution features of clusters based on a probability density function If two sets represented by two clusters are from the same population, they can be grouped into one cluster Through this method, we can reserve the original cluster information features, analyze the dissimilarity between clusters directly based on the distribution features of their data, and determine whether to merge them into one cluster without a hypothesis of the overall distribution form. An experiment on a real dataset illustrates the practicability of the proposed method and further proves that this method can facilitate the reliability of obtaining the inherent distribution of data

DISTANCE MEASUREMENT BASED ON NONPARAMETRIC STATISTICS
EXPERIMENTS
Findings
TWO-DIMENSIONAL DATASETS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.