Abstract

Clustering analysis becomes challenging when the dataset has mixed data types comprising categorical (nominal or ordinal scale) and numerical (interval scale) features. Mainstream distance metrics cannot handle the information in categorical data about the similarity between the observations and cluster centers, leading to performance loss. Various methods are introduced in the literature to handle the mixed data types in clustering. However, each method has disadvantages in capturing categorical information about similarity, adjusting the contribution of categorical information to clustering, and computational or implementation inefficiency. This study proposes a mixed fuzzy C-means clustering method for mixed data types. Two new distance metrics are developed to handle binary and multi-class nominal features. The scaled entropy of each data type is used to adjust the weight of each data type in the overall similarity metric, providing a lower bias since no user-specified weight is required. A comparative numerical study is conducted with twenty real datasets and seven benchmark methods using five cluster validation statistics. The mixed fuzzy C-means clustering performs better than the benchmark methods and is computationally efficient in practice. Since all the computer codes for implementing mixed fuzzy C-means clustering are given, the proposed method is readily applicable to practical problems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.