Abstract

In clustering algorithm research, objects, attributes and other aspects of data sets are usually considered to be independent and identically distributed; that is, each object is assumed to be an independent and uniformly distributed individual with no impacts between objects. However, objects in real life are often neither independently nor identically distributed; that is, they are non-IID, leading to a complex coupling relationship between objects, and objects interact with each other. The results of a clustering algorithm under an independent and identical distribution may be incomplete or even misleading. To make the results of the DBSCAN algorithm as accurate as possible, an improved numerical DBSCAN algorithm based on non-IIDness learning is proposed in this paper. The algorithm calculates the coupling relationship between objects to obtain the potential relationship between objects and determines the parameters Eps and MinPts by the distribution characteristics of the data. Experiments on large-scale real and synthetic data sets show that the algorithm achieves a higher accuracy than the original DBSCAN algorithm and the main algorithms that improved upon it.

Highlights

  • Clustering refers to the grouping of abstract or physical objects in accordance with the principle that objects in groups should be as similar to each other as possible and that objects in different groups be as different as possible under the condition that samples are not marked; the ultimate purpose of clustering is to discover the natural structure of data [1]

  • For numerical data, we propose the NDBSCAN algorithm, which is based on the improvement of the DBSCAN algorithm in the case of a nonindependent identical distributions. this algorithm focuses on the non-IID characteristics of each data in the data set and uses the principle of coupling similarity to quantify the relationship between the data

  • PROPOSED ALGORITHM To cluster numerical data effectively, we propose an improved DBSCAN algorithm based on the idea of a non-independent and identical distribution, namely, the NDBSCAN algorithm

Read more

Summary

INTRODUCTION

Clustering refers to the grouping of abstract or physical objects in accordance with the principle that objects in groups should be as similar to each other as possible and that objects in different groups be as different as possible under the condition that samples are not marked; the ultimate purpose of clustering is to discover the natural structure of data [1]. Y. Wang et al.: Improved Numerical DBSCAN Algorithm Based on Non-IIDness Learning adjacent dense grids to form clusters. AA-DBSCAN algorithm [27] uses a new tree structure based on a quadtree to define the data set density layer. This method allows AFM to find clusters of different densities more accurately. In addition to the non-IIDDBSCAN algorithm, other improvements to the accuracy of the DBSCAN algorithm are based on independent identical distributions. This assumption ignores the internal relation between data points.

REASONS TO IMPROVE THE COUPLING PROCESS IN DBSCAN
THE OPTIMIZED COUPLING ATTRIBUTE ANALYSIS OF NUMERICAL DATA
SELECTING THE EPS PARAMETER
SELECTING THE MinPts PARAMETER
EXPERIMENTAL RESULTS AND ANALYSIS
3: Calculate the quadratic power of each attribute
COMPLEXITY ANALYSIS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call