Abstract

Different from numerical attributes, measuring the similarity between categorical attributes is more complex due to their non-inherently ordered characteristic, especially in an unsupervised scheme. This work, therefore, presents a new method, Heterogeneous Graph-based Similarity measure (HGS), to measure the similarity between categorical data for unsupervised learning. In order to capture the possible complex relationships hidden among attributes, a heterogeneous weighted graph is creatively constructed by extracting the information from categorical data. Both objects and attribute values are represented as nodes and their occurrence and co-occurrence relationships are shown as edges. Based on a derived node-pair graph, three rules are used to iteratively update the similarity scores between object pairs and attribute-value pairs until the scores converge. We also analyze its complexities and validate the metric properties and convergence. In experiment validation, five state-of-the-art measures are compared with HGS based on 20 UCI datasets and 6 high-dimensional datasets in the medical domain in both k-modes and spectral clustering and similarity search experiments. The results show although no measure can outperform all other measures on all datasets, HGS can perform better in both clustering and similarity search tasks on the whole. Finally, six studies further discuss the convergence, time cost, and parameter sensitivity of the HGS, explore its application to imbalanced class distribution, and compare it with its variants by different initialization and graph construction.

Highlights

  • With the continuous increase of data produced from media, medical, and social network, etc., to find the relationship between objects has caught the attention of researchers

  • 3) CLUSTERING RESULTS the experiment results from spectral clustering and k-modes clustering based on six comparative measures conducted on 26 datasets are shown and discussed

  • Spectral clustering experiments were conducted on 26 datasets based on Hamming, Occurrence Frequency (OF), Lin, ALGO, Couple Metric Similarity (CMS), and our proposed Heterogeneous Graph-based Similarity measure (HGS)

Read more

Summary

Introduction

With the continuous increase of data produced from media, medical, and social network, etc., to find the relationship between objects has caught the attention of researchers. Similarity, as an important relationship, is a numeral measure of the degree to which the two data objects are alike, which is usually described as a distance with dimensions representing features of the object [1]. Various attributes can be utilized to determine the similarity between objects. Numerical attributes, it’s more natural to compare them by a series of mature methods, like Euclidean and Minkowski distances [6]. When they are described by categorical (nominal) attributes, the similarity analysis is much more complex with the values unordered and even incomparable [7]. It’s very difficult and unstraightforward to quantify the difference between categorical objects

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.