BackgroundTraditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques.MethodsThis paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011–2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES.ResultsIn simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases.ConclusionsDAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.
Read full abstract