Abstract

This research aims to assess the similarity between determining the value of the similarity of objects by using various measures. Particularly for categorical data, the used similarity measures may behave differently depending on whether the analyzed data are binary, non-binary, or mixed. In the case of categorical data, it begins to matter how many attributes we have and how many different values they can take. Cluster analysis is an essential technique in unsupervised machine learning. Object-clustering cannot be determined without determining similarities between objects in the set and searching for clusters with the most significant internal consistency and external separatory. Therefore, we need knowledge about which measure behaves similarly and which differs depending on the nature of the analyzed data. According to our knowledge, there is no such research in the literature. In research, we analyzed six different measures for eight different datasets (differing in size and data type). The results say, among others, which measures take the longest time for calculations and which are the most correlated with others, so they can be used interchangeably without fear of losing valuable information about the mutual similarity of objects.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call