Abstract

The clustering is an unsupervised learning technique for grouping the unlabeled data based on the proximity between the data points. Therefore, the performance of clustering techniques mainly depends on the proximity measures. The computation of dissimilarity in high dimensional and noisy datasets as well as datasets with imbalanced feature scale, which appear in various applications, is a challenging task. To counter these challenges, we propose a new distance metric to compute the dissimilarity between data points by combining the ensemble properties, entropy and weight information of feature vectors. We consider the statistical information and entropy along each features to compute the dissimilarity between the points. Then each feature is associated with weight based on its distribution information. The proposed Similarity measure based on Entropy for Numerical Datasets (SEND), is free from any domain specific parameters and there are no underlying assumptions about the distribution of the data. The proposed metric is applied on different type of clustering techniques to evaluate its performance. Experimental analyses on synthetic as well as real datasets demonstrate the efficacy of the proposed metric in terms of cluster quality, accuracy, execution time, robustness against noise and its ability to handle the high dimension datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.