Abstract

Distance calculation is straightforward when working with pure categorical or pure numerical data sets. Defining a unified distance to improve the clustering performance for a mixed data set composed of nominal, ordinal, and numerical attributes is very challenging due to the attributes’ different natures. In this study, we proposed a new measure of distance for a mixed-type data set that regards inter-attribute information and intra-attribute information depending on the type of attributes. In this regard, entropy and Jensen–Shannon divergence concepts were used to exploit the inter-attribute information of categorical-categorical and categorical-numerical attributes, respectively. Also, a modified version of Mahalanobis distance was proposed to consider the intra- and inter-attribute information of numerical attributes. We also introduced a unified framework based on mutual information to control attributes’ contribution to distance measurement. The proposed distance in conjunction with spectral clustering was extensively evaluated concerning various categorical, numerical, and mixed-type benchmark data sets, and the results demonstrated the efficacy of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call